arXiv:1503.03703v7 [math.OC] 3 Aug 2016 


Activity Identification and Local Linear Convergence 
of Forward-Backward-type methods* * 

Jingwei Liang^ Jalal M. Fadili^ Gabriel Peyrel 


Abstract. In this paper, we consider a class of Forward-Backward (FB) splitting methods that includes several 
variants (e.g. inertial schemes, FISTA) for minimizing the sum of two proper convex and lower semi-continuous 
functions, one of which has a Lipschitz continuous gradient, and the other is partly smooth relative to a smooth active 
manifold M. We propose a unified framework, under which we show that, this class of FB-type algorithms (i) correctly 
identifies the active manifolds in a finite number of iterations (finite activity identification), and (ii) then enters a local 
linear convergence regime, which we characterize precisely in terms of the structure of the underlying active manifolds. 
For simpler problems involving polyhedral functions, we show finite termination. We also establish and explain why 
FISTA (with convergent sequences) locally oscillates and can be slower than FB. These results may have numerous 
applications including in signal/image processing, sparse recovery and machine learning. Indeed, the obtained results 
explain the typical behaviour that has been observed numerically for many problems in these fields such as the Lasso, 
the group Lasso, the fused Lasso and the nuclear norm minimization to name only a few. 
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AMS subject classiflcations. 49J52, 65K05, 65K10, 90C25, 90C3L 


1 Introduction 

1,1 Non-smooth optimization 

In various fields of science and engineering, such as signal/image processing, inverse problems and machine 
learning, many problems can be cast as solving a structured composite non-smooth optimization problem of 
the sum of two functions, which usually reads 

min $(a:) = F{x) + R{x), (Popt) 

a:eK" 

where 

(H.l) R G ro(M"^), the set of proper convex and lower semi-continuous (Isc) functions on 
(H.2) F G and the gradient VF is ^-Lipschitz continuous; 

(H.3) Argmin(<l>) 7 ^ 0, i.e. the set of minimizers is non-empty. 

From now on, we suppose that assumptions (H,1)-(H.3) hold. Problem ("Popt) is closely related to finding 
solutions of the monotone inclusion problem 

Find a; G M" such that 0 G A{x) -\- B{x), (Pine) 
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where we have 

(H.4) A : M’" =4 M" is a set-valued maximal monotone operator (see (A.l)); 

(H.5) B : M" —>• is maximal monotone and /3-cocoercive (see (A. 2)); 

(H.6) zer(A + B) ^ i.e. the set of zeros of A + B is non-empty. 

For problem (Popt). given a global minimizer x* G Argmin(<l>), then the corresponding first-order optimality 
condition reads 

0 G dR{x*) + VF{x*), 

where dR denotes the sub-differential of R at x* (see definition (1.5)). Clearly, if we let A = dR and 
B = VF, then (Vopt) is simply a special case of (Vine)- 

In this paper, our main focus is the non-smooth optimization problem (Vopt)- Though some of our results 
are also valid for the monotone inclusion problem (Vine), for instance the proposed Algorithm 1 and its global 
convergence analysis, see Theorem 2.1 and 2.3 in Section 2. 

1,2 Forward-Backward-type splitting methods 

The Forward-Backward (FB) splitting method [40] is a powerful tool for solving optimization problems 
(T’opt) with the additively separable and “smooth -i- non-smooth” structure. The standard (non-relaxed) ver¬ 
sion of FB updates a new iterate Xk+i based on the following rule, (xq G is chosen arbitrarily) 

Xk+i = pvox^^fi{xk - 7fcVF(xfc)), 7fc G [c 2/3 - e], (1.1) 

where e, e > 0, and prox.^j:j denotes the proximity operator of R which is defined as 

prox jj(-) = min - -H^ + yR(x). 

The scheme (1.1) recovers fhe gradienf descenf mefhod when R = 0, and fhe classic Proximal Poinf 
Algorifhm (PPA) [53] when F = 0. Global convergence of fhe sequence (xk)kEN generated by FB mefhod 
is well esfablished in fhe liferalure, based on fhe properly lhal fhe composed operator prox.^^(Id — 7 VF) 
is so-called averaged non-expansive [12]. Moreover, sub-linear 0(1/A:) convergence rate of fhe sequence of 
objeclive values of FB is also esfablished in e.g. [47, 16, 14]. 

Inertial schemes and FISTA In fhe liferalure, differenl varianls of fhe FB mefhod were sludied, and a 
popular Irend is fhe inerlial schemes which aim lo speed up fhe convergence properly of FB. In [51], a Iwo- 
slep algorifhm called fhe “heavy-ball wilh friclion” mefhod is sludied for solving (Vopt) wilh i? = 0. If can 
be seen as an explicil discrelizalion of a nonlinear second-order dynamical system (oscillalor wilh viscous 
damping). This dynamical approach lo ileralive melhods in oplimizalion has molivaled increasing allenlion 
in recenl years. For inslance, in real Hilberl spaces, if is used in [4] for solving (Vopt) wilh F = 0 and [5] 
for solving ("Pine) wilh B = 0 yielding an inlerlial PPA mefhod. The aulhors in [44, 8, 41] propose differenl 
inerlial versions of fhe FB mefhod for solving (Popt) and/or (Pine) in real Hilberl spaces. 

On fhe olher hand, in fhe conlexl of convex oplimizalion, fhe acceleraled FISTA mefhod was proposed in 
[14], based upon fhe seminal work of [45], which achieves 0(l/k‘^) convergence rate for fhe sequence of 
objeclive funclions. However, while iterates generated by fhe FB are convergenl, fhe convergence of FISTA 
ilerales has remained a long-slanding open problem. This question was recenlly sellled in [19], followed 
by [9] in fhe continuous dynamical syslem case. More precisely, for 7 ^ g] 0, /3] and a sequence of inertial 
parameter lhal converges al an appropriate rate (i.e. in Ihe Algorifhm 1 below, sel Uk = Q > 

2), Ihese aulhors have esfablished (weak in infinite-dimensional Hilberl spaces) convergence of Ihe iterates 
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sequence while maintaining the 0{\/k‘^) rate on the objective values. This rate is actually even o(l/A:^) as 
proved in [7]. 


Algorithm 1: A General Inertial Forward-Backward splitting 

Initial: d < 1, 5 < 1, e, e > 0 such that e < 2/3 — e. xq G M"", X-i = xq. 

repeat 

Let Ok e [0, d], G [0, b], ■yk ^ [e, 2/3 - e]: 

ya,k Xk “f“ ^k{Xk l); Vb^k Xk T bk{Xk Xk—l^^ (1-2) 

Xk+1 = prox^^ji{ya,k - 7fcVF(y6^fe))- (1-3) 


k = k + 1; 
until convergence', 


In this paper, we propose a generalized inertial Forward-Backward splitting method (iFB) which by form 
covers all the above existing inertial schemes as special cases, see Algorithm 1 . More precisely, based on the 
choice of the inertial parameters and hk, the proposed method recovers the following special cases: 

• afc = 0, 5fc = 0: this is the original FB method [40]; 

• ttfc G [0, d], dfc = 0: this is the case studied in [44] for (Pine)- In the context of optimization with 
i? = 0, one recovers the heavy ball method with friction in [51]; 

• ttfc G [0, a\, hk = dk'- this corresponds to the work of [41] for solving (T^inc)- If moreover restrict yk G 
]0, /3] and let dk —^ 1, then Algorithm 1 specializes to FISTA-type methods [14, 19, 9, 7] developed 
for optimization. 

When dk, hk satisfy dk G [0, d], bk g]0, b], dk ^ bk, Algorithm 1 is new in the literature to the best of our 
knowledge. 

Remark 1.1. 

(i) Though Algorithm 1 is stated for the optimization problem ("Popt). h readily extends to solve the mono¬ 
tone inclusion problem (Vine), for which step (1.3) reads 

Xk+l ~ J-ykAi^Vajk (1-4) 

where J-^a = (Id -|- yA)~^ denotes the resolvent of yA. 

(ii) Though they share the same form of iteration when dk = bk, a notable difference between the inertial 
schemes and FISTA method is the range of choice for the stepsize yk, which is [e, 2/3 — e] for the 
inertial methods, while only ]0,/3] can be afforded by FISTA. This may have some impaef on fhe 
pracfical convergence of fhe algorifhm, see Seefion 5.5 for more defails. 

For fhe resf of fhe paper, we use fhe terminology FB-type methods for any scheme in fhe form of Al- 
gorifhm 1 such fhaf sequence {xk)kefi converges. This will encompass fhe inerfial schemes (denoted iFB) 
fhaf we propose, fhe original FB mefhod of course, and fhe sequence convergenf FISTA mefhod [19, 9] fhaf 
corresponds fo fhe specihe choice of fhe inferfial sequences dk = bk = l/j/f, q > 2. W. should be nofed, 
however, fhaf our global convergence analysis fo be presenfed in Seefion 2 does nof cover fhe case of FISTA, 
which requires a specihe proof sfrafegy as developed in [19, 9]. 
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1,3 Contributions 


The study of (local) linear convergence of FB-type methods in the absence of strong convexity has become 
an active field in recent years, see the related work below for details. In general, most of the existing work 
focus mainly on some special cases (e.g. i? = || ■ ||^ in (Vopt)), and the proofs of the results heavily rely on 
the specific structure of the function R, which makes them rather difficult to extend to other cases. Therefore, 
it is important to present a unified analysis framework, and possibly with stronger claims. This is one of the 
main motivations of this work. To be more precise, this paper consists of the following contributions. 

A general class of intertial algorithms We present a unified iFB splitting class of algorithms for solving 
(Vopt)- It can be viewed as a versatile explicit-implicit discretization of a nonlinear second-order dynam¬ 
ical system with viscous damping, and thus covers existing methods as special cases. We establish global 
convergence of the iterates, and also stability to errors. 

Finite activity identification Under the additional assumption that function R is partly smooth at x* G 
Argmin(<l>) relative to a U^-smooth manifold Aix* (see Definition 3.1) and a non-degeneracy condition 
at X*, we show that any FB-type method to solve (Vopt) has the finite time activity identification property. 
Meaning that, after a finite number of iterations, say K, the iterates Xk —> x* built by the FB-type method 
belong to G Adx* for all k > K. 

Local linear convergence Exploiting this identification property, we then show that the FB-type methods, 
locally along the manifold exhibit a linear convergence regime. We characterize this regime and the 

corresponding rates precisely depending on the structure of the active manifold A4x*. For instance, we 
provide sharp estimates for the convergence rate. If moreover problem (Vopt) has the structure described in 
Section 5.2, where F is quadratic and R is polyhedral, then finite termination can be obtained. 

For the sequence convergent FISTA method, we draw two major conclusions: 

• Locally, FISTA can be slower than the FB method (e.g. see Figure 3); 

• We provide an explanation of the local oscillatory behaviour of FISTA (e.g. see Figure 4); 

we describe precisely how these situations occur. This gives an enlightening explanation of the usefulness 
of the so-called restart method to locally accelerate the convergence of FISTA used by many authors, for 
instance in sparse recovery [27, 48, 26] : the algorithm is restarted after a certain number of iterations (set 
more or less empirically), where the inertial sequence = hk is reset to 0. In our work, we establish exactly 
the oscillation period of the FISTA iteration. 

Building upon our local linear convergence analysis, we provide some pratical acceleration procedures. 
Indeed, once finite identification happens, the non-smooth convex problem (Vopt) becomes (locally) equiva¬ 
lent to a smooth problem in the (possibly non-convex) active manifold j\4x*. In turn, this opens the door 
to acceleration, especially to apply higher order methods such as Newton or non-linear conjugate gradient. 
Several numerical results are reported that confirm all our theoretical findings. 

1,4 Related work 

Finite support identification and local linear convergence of FB for solving a special instance of (Vopt) where 
F is quadratic and R the fi-norm (so-called LASSO problem), though in infinite-dimensional setting, is 
established in [16] under either a restrictive injectivity assumption, or a non-degeneracy assumption which is 
a specialization of ours (see (ND)). A similar result is proved in [28], for F being a smooth convex and locally 
function and R the f i-norm, under restricted injectivity and non-degeneracy assumptions. The f i-norm 
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is polyhedral, hence partly smooth function, and is therefore covered by our results. [3] proved local linear 
convergence of FB to solve (Popt) for F satisfying restricted smoothness and strong convexity assumptions, 
and R being a so-called convex decomposable regularizer. Again, the latter is a subclass of partly smooth 
functions, and their result is thus covered by ours. For example, our framework covers the total variation 
(TV) semi-norm and foo-norm regularizers which are not decomposable. Local linear convergence rate of 
FB for nuclear norm regularization is studied in [33] under local strong convexity assumption. Local linear 
convergence of FISTA for the Lasso problem (i.e. (Popt) for F quadratic and R the f i norm) has been recently 
addressed, for instance in [58], and also [34] under some additional constraints on the inertial parameters. 
The proposed work is also a deeper and sharper extension of our previous result on FB [39]. 

In [30, 31, 29], the authors have shown hnite identihcation of active manifolds associated to partly smooth 
functions for a few algorithms, namely the (sub)gradient projection method, Newton-like methods, the prox¬ 
imal point algorithm and the algorithm in [59]. Their work extends that of e.g. [63] on identifiable surfaces 
(see references therein for related work of Dunn, and Burke and More). The algorithmic framework we con¬ 
sider encompasses all the aforementioned methods as special cases. Moreover, in all these works, the local 
convergence behaviour was not studied. 

1,5 Notations 

Throughout the paper, N is the set of non-negative integers and A: G N is the index. M" is the Euclidean space 
of n dimension, and Id denotes the identity operator on M”. For a nonempty convex set FI C M", ri(r2) and 
rbd(D) denote its relative interior and boundary respectively, aff (FI) is its affine hull, and par (FI) = M(F1—FI) 
is the subspace parallel to it. Denote lq the indicator function of F2, an its support function and Pq the 
orthogonal projector onto FI. For a matrix M, ker(M) is its null-space. We also denote Pq the orthogonal 
projector onto FI. For a linear operator L : M”, we denote Lt = L o Pt, and L+ its Moore-Penrose 

pseudo-inverse. 

The sub-differential of a function R G ro(M”) is the set-valued operator, 

dR:MF^ M”, X {5 G M’"|i?(?/) > R{x) + {g,y- x), Vy G M”}. (1.5) 

We denote 

Ta: = par(0i?(x))'^. (1.6) 

Paper organization The rest of the paper is organized as follows. Global convergence of the proposed 
iFB method is presented in Section 2. Then in Section 3, we introduce the concept of partial smoothness, 
and prove the hnite activity identihcation property of the FB-type methods. We then turn to local linear 
convergence analysis in Section 4. Some hints about acceleration are provided in Section 4.5, and numerical 
results on various popular examples are reported in Section 5. 


2 Global convergence of the inertial Forward-Backward 

In this section, we establish global convergence of the iterates provided by Algorithm 1 . We will state our 
results (Theorem 2.1 and 2.3) for the hnite dimensional optimization problem ("Popt)- In fact, our global con¬ 
vergence results can handle the more general monotone inclusion problem (Pine) in an inhnite dimensional 
real Hilbert space, where weak convergence of the iterates sequence can be obtained. The proofs given in 
Section A are written for this general setting. 
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2,1 Exact case 

Theorem 2.1 (Conditional convergence). Suppose that Algorithm I is run with a < 1, and sequences 
(afc)fceN, (^fc)fceN such that 


ZlfceN™ax{afc,6fc}||a:fc - < +oo. (2.1) 

Then, there exists x* G Argmin(<l') such that the sequence {xk)keN of Algorithm 1 converges to x*. 

The proof of Theorem 2.1 in given in Section A. 

Remark 2.2. If VA: G N, > bk, then (2.1) reduces to the even simpler form 

ill ^ “foo. ( 2 . 2 ) 

Note that this condition is also the one provided in [44, 41] to ensure global convergence. 

The terminology “conditional convergence” used in Theorem 2.1 refers to the fact that for the convergence 
to occur, the sequences {ak)k&N and {bk)keN can be chosen depending (conditionally) on {xk)keN in such 
a way that (2.1) holds. This can be enforced easily by a simple online updating rule such as, given a G 

[0,1],(>G [0,1], 


bk = rain {b,Cb,k}, (2.3) 

^ is summable. For instance, one can choose Ca,k = 


One can also devise choices of {ak)keN and {bk)keN that are independent of {xk)keN, and still guarantee 
global convergence. We dub this unconditional convergence. The following result generalizes those in [5, 
44,41]. 


Ok = mm 




where Ca,k, Cb,k > 0, and max{ca,fc, Cb^k}\\xk - ®fc-i| 
C(j > 0,5 > 0 and similarly for Cb,k- 


k^+^\\xk-Xk_ 


Theorem 2.3 (Unconditional convergence). Let 'jk, Ok and bk as in Algorithm I. Assume that there exists 
a constant r > 0 such that either of the following holds. 


(1 + «fc) - ^(1 + bk)"^ > T :ak < ^bk, 

(1 - Safe) - ||(1 - bk)"^ >T :bk<ak or ^bk < Ok < bk, 


(2.4) 


Then 'f2keN — 2 :fc-i||^ < +oo, and there exists x* G Argmin(<l') such that the sequence {xk)keN of 
Algorithm 1 converges to x*. 


See Section A for the proof. Figure 1 shows graphically the conditions in Theorem 2.3. We let r = 0.01 
and two different choices of 7 are considered. It can be observed that with 7 becoming bigger, the range of 
a, b in (2.4) becomes smaller. 


2,2 Stability to errors 

We now discuss the stability of the iFB method to errors. More precisely, we consider the case where dR{x) 
and VF{x) are computed approximately. Toward this goal, we recall a notion which is inspired by the 
e-approximate sub-differential in convex analysis. 
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(a )7 = /3 (b) 7 = 1.25/3 

Figure 1: Sets of allowable (a, b) ensuring the convergence for a given 7 . (a) 7 = /3; (b) 7 = 1.25/3. We set 
the value of r in (2.4) as 0.01. Each color shaded region corresponds to a different condition appearing in 
(2.4), i.e. the cyan one corresponds to the first inequality of (2.4), while the magenta and red ones correspond 
to the two conditions of the second inequality of (2.4) respectively. 


Definition 2.4 (e-enlargement). Let A : =4 be a set-valued maximal monotone operator, e > 0. 

Then the e-enlargement of A is defined as, 

A^{x) = {u e M"', {u — V, y — x) > —e, Vy G M”, u G A{y)'^. 

From fhe definifion, for 0 < ei < e 2 we have A'^^^x) C A^^{x) and A°(a;) = A(x). Thus A^ is an 
enlargemenf of A. 

Denofe fhe e-enlargemenf of dR. We now consider an inexacf form of fhe iFB algorifhm where sfep 
(1.3) is replaced by fhe corresponding inexacf form fhaf consisfs in finding Xk+i such fhaf 

ya,k 'yk{^ Rijjh^k) T ^k-\-l ^ ^ Ri,^k-\-l) -i (2-5) 

where G MA is fhe error in fhe evaluation of fhe gradienf operator VF. Observe fhaf since fhe e- 
approximafe subdifferenlial of a proper closed convex function is confained in fhe e-enlargemenf of ifs sub- 
differential [17], our selling also handles fhe case of approximate sub-differenlials. 

Proposition 2.5. Consider Algorithm I with the inexact iteration (2.5). Suppose that the conditions in The¬ 
orem 2.1 hold, and moreover, that one of the following holds, 

(i) Ok G]0,d], EfceN^fc < +oo EfceN ^ll^fcll < +oo; 

(ii) Ok = 0, EfceN^fc < < +oo- 

Then the conclusion of Theorem 2.1 holds true. 

See Section A for the proof. This result generalizes that of [44] who considered the case bk = 0 and 
^k = 0- [10] also studied the inexact sequence convergent FISTA method, i.e. Ok = bk = q > 2, with 
the same errors as ours. 
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3 Partial smoothness and finite time activity identification 


3,1 Partial smoothness 

From now on, besides assumption (H,l), we assume that R in (Vopt) is moreover partly smooth function 
relative to a smooth manifold. The notion of partial smoothness is first introduced in [37]. This concept, as 
well as that of identihable surfaces [63], captures the essential features of the geometry of non-smoothness 
which are along the so-called active/identihable manifold. For convex functions, a closely related idea is 
developed in [36]. Loosely speaking, a partly smooth function behaves smoothly as we move on the identih¬ 
able submanifold, and sharply if we move normal to the manifold. In fact, the behaviour of the function and 
of its minimizers depend essentially on its restriction to this manifold, hence offering a powerful framework 
for algorifhmic and sensifivify analysis fheory. 

Lef Af be a C^-smoolh embedded submanifold of around a poinf x. To lighfen terminology, henceforlh 
we shall sfafe C^-manifold instead of C^-smoolh embedded submanifold of M". The nafural embedding of 
a submanifold A4 info permifs fo dehne a Riemannian sfrucfure on A4, and we simply say Af is a 
Riemannian manifold. Tm{x) denotes fhe fangenf space fo Af af any poinf near a: in Af. More materials on 
manifolds are given in SecfionB.l. 

We are now ready fo slate formally fhe class of parlly smoolh funclions Ihrough ils regularily properlies. 

Definition 3.1 (Partly smooth function). Let R G ro(M"), R is said to be partly smooth at x relative to a 
set Af containing x if dR{x) 0, and moreover 

(i) Smoothness: Af is a C^-manifold around x, R restricted to Af is around x; 

(ii) Sharpness: The tangent space Tm{x) coincides with as given (1.6); 

(hi) Continuity: The set-valued mapping dR is continuous at x relative to Af. 

The class of partly smooth functions at x relative to Af is denoted as PSF 2 :(Af). 

One can easily show that a function in ro(M"^) which is locally polyhedral around x is partly smooth at 
x relative to x + T^. Polyhedrality also implies that the subdifferential is locally constant around x along 
x + Tx. Capitalizing on the results of [37], it can be shown that under mild transversality conditions, the set 
of proper Isc convex and partly smooth functions is closed under addition and pre-composition by a linear 
operator. Moreover, absolutely permutation-invariant convex and partly smooth functions of the singular 
values of a real matrix, i.e. spectral functions, are convex and partly smooth spectral functions of the matrix 
[23]. Many examples of partly smooth functions that are popular in signal processing, machine learning and 
statistics will be discussed in Section 5.1. 

[37, Proposition 2.10] allows to prove the following fact. 

Fact 3.2 (Local normal sharpness). If R e PSF 2 :(Af), then all x' G Af near x satisfy Tm{x') = In 
particular, when Af is affine or linear, then T^' = T^. 

We now give expressions of the Riemannian gradient and Hessian (see Section B. 1 for dehnitions) for the 
case of partly smooth functions relative to a 67^ submanifold. This is summarized in the following fact which 
follows by combining (B.2), (B.3), Dehnition 3.1, Fact3.2 and [24, Proposition 17] (or [42, Lemma2.4]). 

Fact 3.3. If i? G PSFa;(Af), then for any x' G Af near x 

V>ii?(x') = PT,,(5i?(x')), 

and this does not depend on the smooth representation of i? on Al. In turn, for all h G T^' 

Vl^G{x')h = Pt ,^^R{x')h + W^,(h,Pj,±VR{x')), 
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where i? is a smooth extension (representative) of Ron and Wxi'-, •) ■ T^x —)• is the Weingarten 

map of yVd at x (see Section B. 1 for definitions). 

3,2 Finite time activity identification 

In this section, we state our result establishing that FB-type methods have the finite activity identification 
property. 

Theorem 3.4 (Finite activity identification). Suppose that the FB-type method is used to create a sequence 
{xk)k&i that converges to x* G Argmin(<i>) such that R G PSFa;* (Ala;*), F is locally around x*, and 
moreover the non-degeneracy condition 


VF{x*) G n{dR{x*)), 


(ND) 


holds. Then, there exists a large enough K > 0 such that for all k > K, Xk & Aix*. 

If moreover, 

(i) Mx* is an affine subspace, then M.x* = x* + Tx* and ya,k, yb,k G Mx*,yk > K; 

(ii) R is locally polyhedral around x*, then ya,k: yb,k G Ada;* = x* -\- Tx* for all k > K, V_a 4 ^* R{xk) = 

andV^j^ ^R{xk) = 0, VA: > K. 

Remark 3.5. 

(i) Recall that FB-type class of algorithms we consider contains the original FB method, the iFB one that 
we propose, and the FISTA method. The iFB is convergent under the assumptions of Theorem 2.1 or 
Theorem 2.3. The FISTA method is sequence convergent for Ok = bk = q > % and 7 fc = 7 G 
]0, /3]; see [19, 9]. Thus, the finite identification property holds true for all these instances. 

(ii) The non-degeneracy condition (ND) can be viewed as a geometric generalization of the strict com¬ 
plementarity of non-linear programming. Building on the arguments of [31], it is almost a necessary 
condition for the finite identification of fAx*. Relaxing this assumption is a challenging problem in 
general. 

Proof. Since F locally is around x*, the smooth perturbation rule of partly smooth functions [37, Corol¬ 
lary 4.7], ensures that <1> G PSF 2 ;*(Ad 2 ;*). 

By assumption, the sequence {xk)k&n created by the FB-type method converges to x* G Argmin(4>), 
and the latter is non-empty by assumption (H.3). Assumptions (H.1)-(H.2) entail that (ND) is equivalent to 
0 G ri(c)(<I>( 2 ;*))). Now (1.3) is equivalent to 


ya,k - lk^F{yk,k) - Xk+i G ykdR{xk+i) 

{ak - bk){xk - Xk-i) + {yb,k - lkSF{yb^k)) - {xk+i - 7fcVF(a:fc+i) G ykd^{xk+i). 


By Baillon-Haddad theorem [1 1], Id — 7 fcVF is averaged non-expansive for the prescribed range of 7 fc, hence 
non-expansive, whence we get 


dist(0,54>(xfc+i)) < ;^ll(afc - bk){xk - Xk-i) + {yb,k - lk'^F{yb^k)) - {xk+i - 7 fcVF(xfc+i))|| 

< —{\ak - bk\\\xk - Xk-i\\ + \\{yb,k - lk'^F{yk,k)) - {xk+i - 7fcVF(j:fc+i))||) 
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Since liminf 7 fc = e > 0 and Xk is convergent, we obtain dist(0,5<i>(a:fc+i)) —> 0. Owing to assumptions 
(H.1)-(H.2), $ is sub-difFerentially continuous at every point in its domain, and in particular at x* for 0, 
which in turn entails ^{xk) —^ Altogether, this shows that the conditions of [30, Theorem 5.3] are 

fulfilled, and the result follows. 

(i) When the active manifold A4x* is an affine subspace, fhen A4x* = x* + T^* owing fo fhe normal 
sharpness properly and fhe claim follows immediately; 

(ii) When R is locally polyhedral around x*, fhen A4x* is an affine subspace and fhe idenlificalion of 

t/a.fc) yb,k follows from (i). For fhe resl, if is sufficienl lo observe lhal by polyhedralily, for any x G 
Mx* near x*, dR{x) = dR{x*). Therefore, combining Fad 3.2 and Fad 3.3, we gel fhe second 
conclusion. □ 


A bound on the identification iteration In Theorem 3.4, we have not provided an estimate K > 0 beyond 
which finite identification occurs. There is of course a situation where the answer is trivial, i.e. R is the 
indicator function of an affine subspace. Flowever, knowing K has practical interest, for instance, if one 
wants to switch to higher order acceleration (see Section 4.5). It is then legitimate to wonder whether such 
an estimate of K can be given. In the following, we shall give a bound in some important cases. For the sake 
of simplicity, we state the result for the case of FB {i.e. = 6^ = 0 in Algorithm 1). A similar reasoning 

can be easily generalized to the case of any converging FB-type method. 


Proposition 3.6. Suppose that the assumptions of Theorem 3.4 hold. Then the following holds. 

(i) If the iterates are such that rhA{dR{xk)) C rbd(r)i?(a;*)) whenever Xk ^ A4x*, then Xk G Mx* for 

all k > _^. 

€^dist(—VF’(x*), rbd((9i?(a:*)))^ 

(ii) If R is separable, i.e. R{x) = where yi < i < m,bi C n}, UI^i = 

n}, and bi fl bj = 0, Vt 7 ^ j, and dim(C'i) = \hi\, then identification offAx* occurs for some k 


larger than 


Iko - X* 


dist(-VF’(a:*)b;,rbd(Ci)) 


,, where Ix = {i: Xb- 7 ^ o}. 


Proof, (i) By firm non-expansiveness of prox.^^, and non-expansiveness of Id — 'jk-i'VF, we have 
\\xk - < ||(Id - 7fe-iVF)(xfc_i) - (Id - 7fc-iVF)(x*)f 

- ||xfc_i - 7fc_iVF(xfc_i) -Xk + 'yk-iVF{x*)f 

< \\xk-i-x*f-e^\\uk-VF{x*)f, 


where we denoted Ufc = (xfc_i—Xfc)/ 7 fc_i—VF(xfc-i). By definition, we have Ufc G dR{xk). Suppose 
that identification has not occurred at k, i.e. that Xk ^ fdx*, and hence Uk G dR{xk) C rbd(()i?(x*)). 
Therefore, continuing the above inequality, we get 

\\xk — x*|p < ||xfc_i — x*\\^ — e^dist(—VF(x*), 5i?(xfc))^ 

< ||xfc_i — x*||^ — e^dist(—VF(x*),rbd(c)i?(x*)))^ 

< ||xo — x*|p — fce^dist(—VF(x*),rbd(5i?(x*)))^, 

and dist(—VF"(x*), rbd(c)i?(x*))) > 0 owing to (ND). Taking k as the largest integer such that the 
right hand is positive, we deduce that the number of iterations where identification has not occurred, 
does not exceed the given bound, whence our conclusion follows. 
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(ii) We have dac^ixl ) = Ci,\/i G In turn, by separability, R is partly smooth at x* relative to 

m 

M-x* = ^ i=i-M.x* ? where = 0 if i G and M.x* 7 ^ 0 otherwise. Suppose that at iteration 
k, n Ix^, 7 ^ 0. Denote hk-i = Xk-i — jk-i'^Fixk-i), and h* = x* — 7 fc_iVF(x*). Thus for any 
i G n Ix^, we have 

Xk,bi ~ X^^ = hk-l^bi ~ P 7 fe_iCi 

= {hk-l,bi ~ hj,.) — (P'y^_iCi{hk-l,bi) ~ 

where we used Moreau identity in the first equality. Since i G ^ Ix*^ have hfc_i 7 ,. ^ 'jk-iCi 
and hi, G y/t-iQ, or equivalently, that P-yk_iCi{hk-i,bi) e 7 A;-irbd(Ci) = 'yk-irbd{daci{xlj) and 
P'jk-iCiihlJ = hi,. Combining this with the fact that the orthogonal projector on '■jk-iCi is hrmly 
non-expansive, we obtain 

||2^fc,6i “ ^biW — l|hfc-l,6i ~ hfj.\\ — \\Pryi__^Ci{hk-l,bi) ~ hfjJI 

= \\hk-i,b, - hl^f - \\P^^_^c,{hk-i,b,) +'yk-i'PF{x*)b,f 

< \\hk-^,b, - Kf - 7Lidist(-VJ^(x*)b„rbd(a))" 

< \\hk-i,b. - hlf - e^disi{-VF{x*)b,,rhd{a))\ 

This bound together with non-expansiveness of prox.^^_^^. and Id — 7 fe_iVF yield 

\\xk “ a; II = '^^j,^jc^\\xk,bi ~ Xfj.\\ + '^^j^j^,^\\xk,bj ~ ^bjW 

< \\hk-i - h*f-e^ dist(-VF(x*)fe,,rbd(Q))' 

X* 

< \\xk-i - X^f - dist(-VF(a;*)b^,rbd(C'j))^ 

X* 

< ||a:o -a:*|p - Yhiei^ dist(-VF(2;*)fe^,rbd(C'^))^ 

X* 

where the last term in the right hand side is strictly positive by (ND). Taking k as the largest integer such 
that the right hand side is positive, we deduce that the number of iterations where /“* fl Ix^ 7 ^ 0 does not 
exceed the given bound. We then conclude that beyond this bound, there is no i such that Mx^. 6 7 ^ 0 
while Mx* = 0. The proof is complete. 

□ 

Note that, as intuitively expected, this bound increases as the non-degeneracy condition (ND) becomes 
more stringent. However, as it depends on x*, it is only of theoretical interest. In the separable case, observe 
that ^dist(—VF(x*)b.,rbd(C'i))^ = dist(—VF(a:*), 0i?(x*))^ when crCi is differentiable at for 
all i G Ix*. The case of the ^i-norm considered in [28] is recovered in the second situation of Proposition 3.6 
with Ci = [—A, A] for some A > 0. 

3,3 Stability to errors 

Consider the inexact version (2.5) with Sk = 0, that is 


Xk+i = pmx^^ji{ya,k - IkC^Fiyb^k) + Cfc))- 
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Assume that {^k)keN is such that {xk)keN converges to some x* G Argmin(<h) (see typically the summability 
conditions in Proposition 2.5(i)-(ii)). Then, since ^k 0, it can be easily seen from the proof of Theorem 3.4 
that the activity identification property holds true for the above inexact iteration. 

However, one cannot afford in general having non-zero errors in fhe implicif step as in (2.5), even 
summable (see Proposifion 2.5). The deep reason behind fhis is fhaf in fhe exacf case, under condition (ND), 
fhe proximal mappings of R and R + i^M^* locally agree nearby x*. This properly is clearly violaled if 
approximale proximal mappings are involved. Here is a simple example. 



Figure 2: Graph of (Id -f 5^1 ■ |) 


Example 3.7. Lei F ixGMi—> ^\5 — x\^, wilh 5 g] — 1,1[, and i? : a; G M i--> |a;|. If is easy lo see lhal 
<1> G ro(M), and if has a unique minimizer x* = prox|,|(^) = max(l — 1/|(5|, 0)(5 = 0. Moreover, 4> is parlly 
smoolh al x* relative Aix* = {0}, and 5 — x* = 5 G n(^dR{x*)) =] — 1,1[. Consider fhe inexacl version 
of fhe FB algorilhm 

Xfc+i G (Id-f ■ |)"^(5), (3.1) 

where we sel 7 ^ = 1, since VF is 1-Lipschilz. From [17, Example 5.2.5], we have 


d^\ ■ Kx) = < 


'[1 -e/x,l] 
[- 1 , 1 ] 


if X > e/2 
if \x\ < e/2 
if X < —e/2, 


whence fhe graph of (Id + ■ |) ^ can be easily deduced as displayed in Fig. 2. Thus, depending on e^ and 

fhe choice made in fhe inclusion (3.1), Xk may never vanish for any hnile k, i.e. Xk ^ A4x* for any hnile k. 


4 Local linear convergence of FB-type methods 

We are now in posilion lo presenl Ihe local linear convergence resull for FB-lype melhods, and all Ihe proofs in 
Ihis section are collected in Section B. Throughoul Ibis section, x* is a global minimizer of problem (Vopt) 
such lhal Ihe sequence {xk)k&N provided by Ihe FB-lype melhod Xfc converges lo x*. A4x* is Ihe partial 
smoolhness manifold of R al x*, and T^* Ihe corresponding langenl space. 
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Restricted injectivity In addition to the local C^-smoothness assumption of F made in Theorem 3.4, we 
suppose the following restricted injectivity condition, 

ker(v2F(®*)) n T^* = {0}. (RI) 

The local continuity of the Hessian of F then implies that there exist a > 0 and e > 0, such that V/i G Tx*, 

{h, V‘^F{x)h) > a\\hf,\/x G (4.1) 

It turns out that under conditions (ND)-(RI), one can show that problem (Vopt) admits a unique minimizer, 
and local quadratic growth of T* if i? is moreover partly smooth. Recall that a function <I> grows quadratically 
locally around x* if 3c > 0 such that <i>(a;) > <I>(a;*) + c\\x — x*\^, Vx near x*. 

Proposition 4.1 (Uniqueness of the minimizer). Under assumptions (H.1)-(H.3), let x* G Argmin(<l>) be 
a global minimizer of (Popt) such that F is locally around x*. If conditions (ND) and (RI) are also 
fulfilled, then 

(i) X* is the unique minimizer of (Popt)- 

(ii) If moreover R G PSFa;* (Mx*), then has at least a quadratic growth near x*. 

Remark 4.2. In Proposition 4.1, partial smoothness of R at x* is not needed for the uniqueness claim (i). 
However, it brings more structure, hence the local quadratic growth property in (ii). 

4,1 Locally linearized iteration 

Define the following matrices which are all symmetric, 

H = 7 Pr,* v2F(x*)Pr,*, G = ld-H, U = 4>(x*)Pt,* - H, (4.2) 

where *‘1’ is the Riemannian Hessian of 4' on the manifold Aix* (see Fact 3.3). 

Lemma 4.3. For problem (Popt). l^t (H,1)-(H,3) hold and x* G Argmin(<I') such that R G PSFa;*(A 42 ;*) 
and F is locally around x*. Then U is symmetric positive semi-definite under either of the following 
circumstances: 

(i) (ND) holds. 

(ii) fAx* is an affine subspace. 

In turn. Id + (7 is invertible, and W = (Id + U)~^ is symmetric positive definite with eigenvalues in ]0,1]. 
The following simple lemma gathers important properties of the matrices in (4.2). 

Lemma 4.4. For the matrices in (4.2) and W, 

(i) Under (H.2) and (RI), 

(a) H is symmetric positive definite with eigenvalues in ]'ya, ^]. 

(b) For 7 G [e, 2(3 — l], e and e > 0, G has eigenvalues in[—l-\-^,l — ae[c] — 1,1[. 

(c) For 7 G [e, (3], G is also symmetric positive semi-definite with eigenvalues in [0,1 — ae[c [0,1[. 

(ii) If both the assumptions ofLemma4.3 and (i) hold, then WG has real eigenvalues lying in ] — 1,1[. ^ 
moreover 7 G [e, (3\ then WG has eigenvalues lying in [0,1[. 
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def 


, and the matrix 

rk-ij 


Let a G [0, a], 6 G [0, 6 ], 7 G [e, 2/3 


e], define = Xk — x*, dk = 




{a-b)W+ {l + b)WG -{a-b)W-bWG 
Id 0 


(4.3) 


Our interest in the vector dk is inspired by the convergence rate analysis of the heavy ball method [52, Sec¬ 
tion 3.2]. 

We now show that once the active manifold is identified, FB-fype iferafion locally linearizes. 

Proposition 4.5 (Locally linearized iteration). Let (H.1)-(H.3) hold, and assume that an FB-type method 
is used to create a sequence {xk)keN that converges to x* G Argmin(<l>) such that (ND) and (RI) hold. If 
moreover, 

Ofc a G [0,1], G [0,1], 7fc 7 G [e, 2/3 - e], (4.4) 

then for k large enough, we have 

dk+i = Mdk + o{\\dk\\). (4.5) 

The of) term disappears when R is locally polyhedral and ( 7 ^, Uk, bk) are chosen constant. 

Remark 4.6. 

(i) (4.4) asserts that both the inertial parameters (afc, bk) and the step-size jk should converge to some 
limit points, and this condition cannot be relaxed in general. 

(ii) For the FB method (i.e. ak = bk = 0), (4.3) can be further simplified, and the corresponding linearized 
iteration can be stated in terms of directly, which reads 

Tk+i = WGrk -f o(||rfc||). (4.6) 

(hi) Proposition 4.5 also covers the sequence convergent FISTA method [19, 9], t.e. Ok = bk = where 
O' > 2 is a constant, and 7 ^ = 7 g]0, /3]. In this case, we have indeed —> a = 5 = 1. 


4.2 Spectral properties of M 

Our aim now is to establish local linear convergence of FB-type schemes. For this, given the structure of 
the locally linearized iteration (4.5), it is sufficient to strictly upper-bound by 1 the spectral radius of M, and 
conclude using standard arguments. This is what we are about to do. 

The rationale is to start by relating explicitly the eigenvalues of M to those of G or WG, and then use 
Lemma 4.4 to upper-bound the spectral radius of M. However, given the structure of M, this is a challenging 
linear algebra problem, and can only be done for some cases: a and b possibly different but the the function R 
is locally polyhedral, or i? is a general partly smooth function but a = b. These situations are not restrictive 
at all and cover all interesting applications we have in mind. 

Let 7] and a be an eigenvalue of WG and M respectively. We denote p, fj the smallest and largest (signed) 
eigenvalues of WG, and p{M) the spectral radius of M. 


Locally polyhedral case 

to the following form 


When R is locally polyhedral, U vanishes and W = Id, then M in (4.3) simplifies 



6 )Id + (l + 6 )G, 
Id, 


-(a - 6 )Id - bG 

0 


(4.7) 
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Proposition 4.7. If 



is an eigenvector ofM (4.7) corresponding to an eigenvalue a, then it must satisfy 


ri = ar 2 . Moreover, we have 

(i) r 2 is an eigenvector of G associated to an eigenvalue rj, where rj and a satisfy the relation 


cr^ - ((a - 6) + (1 + b)r])a + {a - b) + brj = 0. (4.8) 

(ii) Given any (a, b) G [0,1[^, then p{M) < 1 if and only if 

1 + 21 , ^ 2 ' 

Remark 4.8. Though G has n eigenvalues, it can be shown that, given a and b, p{M) is determined only by p 
and r). These extreme eigenvalues lie in ] — 1,1[ (7 g] 0, 2/3 [) or even in [0,1[ (7 g] 0, /?]) by Lemma 4.4(i)(b)- 
(c). 


General partly smooth case When i? is a general partly smooth function, then U is nontrivial, and the 
spectral analysis of (4.3) becomes a generalized eigenvalue problem which is much more complex. Therefore, 
we assume 6 = a, in which case M reads 


M = 


{l + a)WG, -aWG 
Id, 0 


(4.10) 


We have the following corollary of Proposition 4.7. 


Corollary 4.9. Let b = a. If 


be an eigenvector of M corresponding to an eigenvalue a, then it must 


satisfy ri = ar 2 - Moreover r 2 is an eigenvector of G related to eigenvalue p, where p and a satisfy the 
relation 


cr^ — (1 + a)pa + op = 0, 


(4.11) 


and p{M) < 1 if and only if 


-1 


< p. 


1 + 2a — 

Remark 4.10. Condition (4.12) holds naturally for 7 g] 0, /3], since by Lemma 4.4(ii), for such 7, r/ > 0. 


(4.12) 


4,3 Local linear convergence of FB-type methods 

Now we are able present the local linear convergence result of FB-type method, and start with the case where 
R is locally polyhedral around x*. 

Theorem 4.11. Suppose (H.1)-(H.3) hold, and an FB-type method generates a sequence Xfc —>■ x* G 
Argmin(<l>) such that R is locally polyhedral around x*, F is near x*, and conditions (ND), (RI) are 
satisfied. If moreover (4.4) and (4.9) hold, then {xk)kEN converges locally linearly to x*. More precisely, 
given any p G [p(M), 1[, there exists K > D and a constant G > 0, such that for all k > K, there holds 

\\xk - X*\\ < Gp^~^\\xK - a:*||. 

Proof. Combining Proposition 4.5, Proposition 4.7 and [52, Section 2.1.2, Theorem 1], leads to the claimed 
result. □ 
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Remark 4.12. p(M) is the optimal rate. Indeed, when ak = a,bk = b and 7^ = 7, the o(-) term vanishes 
in (4.5) and thus, p = p{M). 

Let’s turn to the case where i? is a general partly smooth function, but b = a ^ [0, d] as in (4.10). 

Theorem 4.13. Suppose assumptions (H.1)-(H.3) hold, and the FB-type methods generate a sequence —> 
X* G Argmin(<I>) such that R G PSFa;* F is near x*, and conditions (ND), (RI) are satisfied. If 

moreover (4.4) holds with b = a, and (4.12) is satisfied, then {xk)ke'H converges locally linearly to x*. More 
precisely, given any p G [p{M), 1 [, there exists iL > 0 and a constant C > 0, such that for all k > K, there 
holds 

\\xk - 2;*|| < Cp'^~^\\xK - a;*||- 

Proof. This follows by combining Proposition 4.5, Corollary 4.9 and [52, Section2.1.2, Theorem 1]. □ 

Remark 4.14. 

(i) The limit 6 = a in (4.4) does not mean that we should set bk = a^, Vfc G N along the iterations. 

(ii) In contrast to our previous work [39], which addresses the case of FB method, the rate estimates that we 
provide here are much sharper in general, and both estimates only coincide when R is locally polyhedral 
(see the numerical experiments for more details). The main reasons underlying this is that, here, our 
rate estimate relies on the locally linearized iteration in Proposition 4.5 and the spectral properties of 
M, which takes intro account the geometry of the identified submanifold (its curvature for instance). 
This is not the case in our former work. 

(iii) The obtained results can be readily extended to the variable metric FB splitting method [22], where a 
rate under an appropriate metric can be obtained. However for the sake of brevity, we do not pursue 
this further. 

(iv) In our proof of local linear convergence, convexity does play a crucial role. For instance, it was only 
needed to show that the matrix U is positive semi-definite. This suggests that our local linear conver¬ 
gence claims can be extended to the non-convex case, provided that the Riemannian Hessian of R is 
assumed positive semi-definite at x*. In addition, to guarantee finite identification in the non-convex 
setting, we need global convergence of iFB to a critical point, which can be ensured if for instance 

satisfies fhe (non-smoofh) Kurdyka-Lojasiewicz inequalify [15]. This will be lefl fo a forlhcoming 
paper. 

The resfricfed injecfivify condifion (RI) plays an imporfanf role in our local convergence rale analysis and 
in general cannol be relaxed. However, for some special cases, such as when R is locally polyhedral, if can 
be removed, al fhe price of less sharp rale eslimalion. This is formalized in fhe following slalemenl. 

Theorem 4.15. Suppose that (H.1)-(H.3) hold, and an FB-type method creates a sequence Xk ^ x* E 
Argmin(4>) such that R is locally polyhedral around x*, F is near x*, and condition (ND) holds. If 
moreover there exists e > 0 and a subspace V such that 

ker(PT,v2F(x)PrJ =1^, Vx G n {x* + T,*)- 

Then {xk)keN converges locally linearly to x*. 

The expressions of fhe local rale can be found by inspecling fhe proof. 

4,4 Discussion 

In Ihis pari, we presenl some discussions on fhe oblained local linear convergence resull, and mainly focus 
on fhe difference FISTA and fhe iFB melhods. 
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FB is locally faster than FISTA For the sake of brevity (the same conclusions hold true in the general 
case), we consider 6 ^ = = a G [0, 1 ] and 7 ^ = 7 g] 0,/3] is fixed, in which case rj > r] > 0 (see 

Lemma4.4(ii)), and thus condition (4.12) is in force. Moreover rj is also the local convergence rate of the 
FB method, and p{M) depends solely on rj and the value of a. Recall that p(M) is the best local linear 
convergence rate (see Theorem 4.13 and 4.11). 

Figure 3 shows p{M) as a function of a for fixed fj. One can make fhe following observafions: 

(1) When a G we have p{M) < fj. This enfails fhaf if iFB is used wifh such a choice of inerfial 

parameter, if will converges locally lineally faster fhan FB. For a G \fj, 1], fhe sifuafion reverses as 
p{M) > fj, and iFB becomes slower fhan FB. 

(2) In parficular, as a = 1 for FISTA, we have p(M) = ^yfj > fj. In plain words, fhough FISTA is 
known fo be globally fasler (in ferms of fhe objecfive) fhan FB, affaining fhe optimal 0(1/ rale, 
locally, fhe sifuafion radically changes as FISTA will always ends up being locally slower fhan FB. A 
similar observation is made in [58] for fhe special case of a varianl of FISTA used lo solve fhe LASSO 
problem. This explains in particular why many aulhors [27, 48] resorl lo reslarling lo accelerale local 
convergence of FISTA, which consisls in reselling periodically Ihe scheme lo a = 0 which is more 
favorable lo FISTA. Our predictions in Figure 3 gives clues on when lo reslarl (i.e. deled Ihe poinl in 
red on Ihe rate curve). We will elaborate more on Ihis in Ihe numerical simulations in Section 5.5. 

(3) p(M) attains ils minimal value al a = and Ihis is Ihe besl convergence rate lhal can be 

achieved locally for FB-lype melhods. 



Figure 3: Lei b = a, and assume p, p are known and also close enough such lhal Ihe speclral radius p(M) is 
only affected by fj, Ihen p(M) is a function of a. 


Oscillation of the FISTA method A typical feature of the FISTA method is that it is not monotone and 
locally oscillates [13], which makes the local convergence even slower, see Figure 4 or 5 for example, or [58] 
for a FISTA-variant applied to the LASSO problem. In fact, the iFB scheme shares this property as well 
when the inertial parameters are large. 

Such oscillatory behaviour is due to the fact that, for those inertial parameters, the eigenvalue (Tmax such 
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that IcTmaxI = p{^) is complex. It can then be seen that the oscillation period of ||xfc — a:*|| is exactly 
where 9 is the argument of CTmax- For the parameter settings used in Figure 3, i.e. b = a and 7 e]0,/3], we 
have 

f a G [0, (t-v^) ] . is real, 

[ a g] (t-v^^) -, ig complex, 

then as long as a > the iFB method locally oscillates. See Figure 5 for an example. 

4,5 Acceleration 

The finite time activity identification property (Theorem 3.4) implies that, the globally convex but non¬ 
smooth problem eventually becomes locally C^-smooth, but possibly non-convex, constrained on the activity 
manifold. This opens the door to acceleration, and even finite termination, exploiting the structure of the 
objective and that of the identified manifold. There are several ways to achieve this goal as we explain 
hereafter. 


Optimal first-order method In this case, the idea is to keep the scheme implemented in Algorithm 1, and 
to refine the parameters to minimize the local convergence rate established in Section 4. Indeed, as shown 
in Figure 3 and the discussion that follows, there is a proper choice of the inertial parameters a and b that 
minimizes p{M). More precisely, choose 7 g] 0,/3], then 7 = 1 — 07 > 7 > 1 — 7 //? > 0, and p{M) 
depends only on rj, a and b. Then with fixed 7 (hence fj), p(M) attains its minimal value for a and b satisfying 


b = a : a = 
b ^ a a = 


(1 - yi - rjf ^ 1 - 

7 1 + ^/a7’ 

(1 - Vl -7)^ + 6(1 -7) 


(1 - + 6a7> 


(4.13) 


and the optimal value p* of p{M) reads 

p* = 1 - = 1 - 


(4.14) 


where the second equality comes from (4.2) and Lemma 4.4. This is a decreasing function of 7 , and p* = 
1 — -y/a/? is then the minimal rate attained for 7 = /?. This rate is in agreement with that [46, Theorem 2.2.2]. 
If one can afford 7 > /3 as in our iFB schemes, owing to the result of [52, Section 3.2.1], the best local linear 
rate is actually 


P 


-k 


1 - 

1 + a/o/? 


for 7 


4/? 

(1 + ^/a/3)^’ 


a = 


n- 

VI + 


and 6 = 0 . 


This is known to be the optimal rate that matches the lower complexity bounds for first-order methods to 
solve the class of problems (Vopt) if P were also a-strongly convex [46, Theorem 2.1.13]. In comparison, 
for the FB method {i.e. a = 6 = 0), the optimal rate is p* = rf' = attained for 7 = 


Finite convergence in the polyhedral case Finite termination can be obtained if R is locally polyhedral 
around x*, and F is quadratic, i.e. problem {Vx) with R locally polyhedral around x*. In this situation, 
under hypothesis (ND), we have finite identification of x* + T^*. In addition, (RI) is equivalent to injectivity 
of the linear operator L on T^*. Altogether, this allows to show that x* can be written explicitly as 

{dR{xK)) - LP.,+t.^(0), 
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for K sufficiently large. When the manifold is linear, i.e. x* G T^;*, the last term vanishes and the above 
relation can be implemented in practice. 


High-order acceleration: Newton method Once the activity manifold has been identified, one can switch 


to Newton-type methods for locally minimizing <1>. This can be done either using local parameterizations 
obtained from ^/-Lagrangian theory or from Riemannian geometry [36, 42, 56]. One can also use the Rie- 
mannian version of the non-linear conjugate gradient method [56]. For these schemes, one can also show re¬ 
spectively quadratic and superlinear convergence since ^ is positive definite by Proposition 4. l(ii). 

5 Numerical experiments 

In this section, we illustrate the obtained results by some popular examples drawn from linear inverse prob¬ 
lems in signal processing and machine learning (including sparse recovery). We first start by discussing a 
few examples of partly smooth functions that are widely used in those applications. 

5,1 Examples of partly smooth functions 

Example 5.1 (fi-norm). For x e M", the fi-norm is defined as 



which is polyhedral, hence partly smooth at any x relative to the subspace 


At = Ta; = {n G M"" : supp(u) C supp(a:)}, supp(®) = {i : Xi ^ 0}. 


Its Riemannian gradient at x is sign( 2 ;i) for i G supp(a;), and 0 otherwise. Its Riemannian Hessian vanishes. 

Example 5.2 (fi^ 2 -norm). Let the index set {1, ..., n} be partitioned into non-overlapping blocks B such 
that UbeB ^ • • •! ^}- The fi^ 2 -norm of x is given by 


R{x) = || 2 ;|| i ,2 = Y.b€B\\^bl 


where Xb = {xi)i^b G Though this function is not polyhedral, it is easy to see that it is partly smooth at 
X relative to the subspace 


Ad = Ta; = {u G M"- : suppg(u) C <Sb}, Sb = 1J{(> : Xb^O}. 


It is straightforward to show that 



where. 
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Example 5.3 (Total Variation). If Rq g PSF d*x{-^o)^ then, under a mild transversality condition, it is 
shown in [37, Theorem 4.2] that R G PSF 2 :(AI) where = {u G M" : D*u G Ado}- Popular examples 
include the anisotropic total variation (TV) semi-norm in which case Rq = || ■ ||^ and D* = Ddif is a 
finite difference approximation of fhe derivafive [55]. For TV, R is fhen polyhedral, hence parfly smoofh af 
X relative fo 

Ad = = {u G M" : supp{D*u) C supp(i2*2:)}. 

Us Riemannian gradienf reads Px^sign(F)*3;) and ifs Riemannian Hessian vanishes. 

Example 5.4 (foo-norm). For x G fhe anfi-sparsify promoting foo-norm is defined as following 

R(x) = llxll.^ = max IxJ. 

MOO 


If can verified fhaf i? is a polyhedral norm, hence parfly smoofh af x relative fo 


Af =r, /(x) = {f: \x. 


'\ def 

= 


sign(2;i), 

0 , 


iff G I{x), 
ofherwise. 


The Riemannian gradienf of || ■ ||^ af x is s/\I{x)\, and ifs Riemannian Hessian vanishes. 

Example 5.5 (Nuclear norm). For x G rank(a:) = r, lef x = Ud\a,g{a{x))V* be a reduced 

rank-r SVD decomposition, where U G and V G have orfhonormal columns, and cr(x) G 

(M+ \ {0})’' is fhe vector of singular values (cri(a;), ■ ■ ■ , ar(x)) in non-increasing order. Low-rank is fhe 
specfral exfension of vector sparsify to mafrix-valued dafa x G i.e. imposing sparsify on fhe singular 

values of x. The nuclear norm is fhus defined as 


R(x) 


= 


ii- 


Piecing fogefher [23, Theorem 3.19] and Example 5.1, fhe nuclear norm can be shown to be parfly smoofh af 
X relative fo fhe sef of fixed-rank mafrices 


M = {ze 


X712 


Tank{z) = r}, 


which is a -manifold around x of dimension (m +n 2 — r)r, see [35, Example 8.14]. 
Moreover, we have 


= {UA* + BV* : A G R^^^’',Be and V>i||x||^ = UV*. 

Erom [60, Example 21], one can show fhaf for h G T^, 


where 

IkIL = lk(2;)|li = 

is a C^-smoolh (and even convex) represenfafion of fhe nuclear norm on Ad near x, obfained owing fo fhe 
smoofh fransfer principle [23, Corollary 2.3]. The expression of fhe (Euclidian) Hessian V^|| 2 ;||^ can be 
obfained in several ways, see [60, Example 21] for defails. 
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5,2 Linear inverse problems 

In this part, we apply our results to the setting of linear inverse problems. Consider the following forward 
observation of a vector Xoh G IK" 

y = Lxoh + u), (5.1) 

where y G M'” is the observation, L : —>• is some linear operator, and w G stands for noise. 

Solving such linear inverse problems can be cast as the optimization problem 

where A > 0 is the regularization parameter, R G ro(IK’^) encodes prior knowledge on Xoti hence 
promotes objects similar to to it, and A > 0 is a regularization parameter. Moreover, when there is no noise 
in the observation (5.1), namely w = 0, the following equality constrained problem should be considered 

min R(x) s.t. Lx = Lxob- CPo) 

a:eK" 

The following result is a straightforward generalization of [62, Theorem 1] to any FB-type method, using 
Theorem 3.4 and Theorem 4.11 (or 4.13). 

Proposition 5.6. Assume that R G PSF 2 ;^j^(AIa;ob)’ condition 

ker(L) n = {O} and M:,^^R{xoh) G ri(ai?(a:ob)), (5.2) 

hold. If moreover w is sufficiently small and A is chosen in the order of ||m||, then (Vx) admits a unique 
solution X* with A4x* = FB-type methods will identify A4x* infinite time, and then converge 

locally linearly. 

This proposition implies that under the given conditions, the minimizer of (Vx) lies in the same manifold 
as the feasible point of (Vq). It is now sufficient to infer when (5.2) is satisfied for the above proposition 
to hold true. For instance, when L is a random Gaussian measurement matrix, nice and easily verifiable 
conditions can be stated for the examples introduced in Section 5.1 above. 

Proposition 5.7. Choose L from the standard Gaussian ensemble, i.e. the entries of L are independent 
copies of a mean-zero and standard Gaussian random variable. Then (5.2) is in force with high probability 
in the following cases: 

(i) i? = II ■ 11 ^.' let s = ||a:ob|lo> > 2 cslog(n) + s for some c > 1 ; 

(ii) i? = II ■ 11^ 2 - ® number of non-zero blocks, ifm > (1 + c)s{^x/n/Ns + ■\/2 log(A(g))^ + 

sn/Ns where c> 1, and Ns is the total number of blocks; 

(in) i? = II ■ 11 ^.- let I{x) = {i : |(a;ob)i| = lla^oblloo} « = |I(a:)|, ifm > n - s + 2 cslog(s/ 2 ), 
where c > 1; 

(iv) i? = II ■ 11^.' let r = rank(a;ob), a^ob G ifm > cr(3ni + 3 n 2 — br) for some c > 1. 

Proof. This follows from [18, Section3] for (i), (ii) and (iv), and (in) from [61, Theorem?]. □ 
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5,3 Experiments setup 


Recovery from random measurements We consider solving (Vx) with R being ^oo-norms, TV 

semi-norm and nuclear norm. The observations are generated according to (5.1). Here L is generated from 
the standard Gaussian ensemble and the following parameters: 


fi-norm {m,n) 
fi, 2 -norm {m,n) 
foo-norm {m,n) 

Total Variation (m, n) 
Nuclear norm {m,n) 


(48.128) , ||xob|lo = 8 ; 

(60.128) , Xob has 3 non-zero blocks of size 4; 

(123.128) , |/(a:ob)| = 10; 

(48.128) , ||f^DiF2^ob|lo = 8 where Udif is the finite difference operator; 
(1425, 2500), Xob G R^oxso rank(a;ob) = 5. 


If can be noficed fhaf fhe number of measuremenfs m is chosen sufficienlly large such fhaf Proposifion 5.7 
allows to asserf fhaf (ND) and (RI) are verified af Xoh- We also choose ||to|| small enough and A in fhe order 
of ||r(;|| so fhaf Proposifion 5.6 applies. 


TV deconvolution We also consider a 2D image processing problem, where y is a degraded image gener- 
afed according to (5.1), where L is a circular convolufion mafrix wifh a Gaussian kernel. The (anisofropic) 
TV regularizer (see Example 5.3), which is polyhedral, is used. 

Note however fhaf for a sparse deconvolufion problem fhrough f i-minimizafion, Proposifion 5.7 does nof 
apply, hence enfailing fhaf exacf recovery of fhe supporf of Xoh in general is impossible, see [25]. However, 
under fhe same condifions on Xoh and A as in Proposifion 5.7, x* has a supporf slighfly larger fhan fhaf of 
Xob, and moreover, x* salisfies bofh (ND) and (RI). See [25, Corollary 1]. 


5,4 Comparison of the FB-type methods 

Parameter settings For all fhe mefhods in comparison (FB, iFB and FISTA), we fix 7 ^ = p. For fhe 
sequence convergenf FISTA mefhod [19, 9], fwo differenl choices of q are considered, which are 2 and 50. 
For fhe iFB mefhod, we lef bk = cifc, and use fhe following rule to update a^. Fef to = l,p g]0, -|-oo[, fhen 


ffc — 


1 + y^l -f ptl 


t ffc —1 1 

) nfc 7 

tk 


p e]0,4[ : 4 
p e [4,-foo[ : 4 


4 — p’ 
-fOO, Ok 




, P 
4’ 

Vp' 


(5.3) 


In fhis fesf we choose p = 4(\/5 — 2 — 10 ^) so fhaf Theorem 2.3 applies. Nofe fhaf in fhe original FISTA 
paper [14], (5.3) is also used buf wifh p = 4 fixed. 

The convergence profiles of \\xk — a:*|| are shown in Figured. As demonsfrafed by all fhe plofs, iden- 
lificalion and local linear convergence occurs after finite time. The solid lines (denofed as “P”) represenf 
fhe observed profiles, while dashed ones (denofed as “T”) sfand for fhe fheorefically predicfed ones. The 
positions of fhe cyan poinfs (or fhe sfarfing poinfs of fhe dashed lines) sfand for fhe iteration af which A4x* 
has been identified. 


Tightness of predicted rates For fhe f 1 , f 00 -norms and TV semi-norm, our predicfed rafes coincide exacfly 
wifh fhe observed ones (same slopes for fhe dashed and solid lines). This is due fo fhe fad fhaf fhey are all 
polyhedral and F is quadratic. Nofe fhaf for FISTA, which is non-monolone, fhe prediction coincides wifh 
fhe envelope of fhe oscillations. For fhe fi 2 -norm, fhough if is nof polyhedral, our predicted rates sfill are 
very fighf, due to fhe fad fhaf fhe Riemannian Hessian is faken into accounf. Then for fhe nuclear norm. 
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(a) £i-norm 


(b) ^ 1 , 2 -norm 


(c) £oo-norm 



(d) TV semi-norm 


(e) Nuclear norm 


(f) Anisotropic TV deconvolution 


Figure 4: Local linear convergence and comparison of the FB-type methods (FB, iFB and FISTA) in terms 
of llxfc — a;*||. We fix 7 ^ = /3 for all the methods, moreover, for the iFB method, we let bk = ak = 
\/5 — 2 — 10“^, and for the FISTA method, q = 2,50 are considered. For each figure, “P” sfands for pracfical 
observed profiles, while “T” indicafes fheorefical predicfions. The cyan poinfs indicate fhe iferafion af which 
Aix* has been identified. 


whose active manifold is nol anymore a subspace, our estimation becomes slighfly less sharp compared fo 
fhe ofher examples, fhough barely visible on fhe plofs. For bofh fhe f'l 2 -norm and nuclear norm, since fhe 
Riemannian Flessian is faken info accounf, fhe predicted rales are much sharper lhan our previous eslimales 
for fhe FB melhod in [39]. 

For fhe image deconvolulion problem, assumptions (ND) and (RI) are checked a posteriori (verified for 
Ihis experimenl). This logelher wilh fhe fad lhal fhe anisolropic TV is polyhedral justifies lhal fhe predicled 
rale is again exacl (up lo machine precision). 

Comparison of the methods From the numerical results, we can draw the following remarks: 

(i) Overall, FISTA with g = 50 (black line) is the fastest while q = 2 (gray line) is the slowest. FB and 
iFB are sandwiched between them with iFB being the faster. 

(ii) For the finite activity identification, however, FISTA q = 2\a general shows the fastest identification 
(see the starting points of the dashed lines), and FB is the slowest. 

(hi) Locally, similar to the global convergence, FISTA g = 50 has the fastest rate and g = 2 is the slowest. 
Again, FB and iFB are between them with iFB being faster than FB. 

It can be concluded from the above remarks that, in practice, FISTA method with g = 2 is not a wise choice 
if high accuracy solutions are needed. Indeed, under this choice, ak converges to 1 too fast, and this hampers 
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its local behaviour as the discussions we anticipated in Section 4.4 (see Figure 3). In fact, such behaviour 
of Ufc can be avoided by choosing relatively bigger q, and this is exactly what the difference between q = 2 
and q = 50 implies. In our tests, q G [50,100] seems to a good trade-off, even bigger q is nof recommended 
since if may lead fo a much slower acfivify idenfificafion. A similar observafion is also menfioned in [19], 
where fhe aufhors only fried q = 2, 3,4. If should be noted fhaf fhe original FISTA mefhod [14] has almosf 
fhe same behaviour as fhe case q = 2. 

If should be pointed ouf fhaf fhe local rate of FISTA q = 50 being faster fhan FB does nof confradicf wifh 
our claim in Secfion 4.4 fhaf FB is faster fhan FISTA locally. The reason is fhaf we are limited by machine 
accuracy, and bigger value of q delays fhe speed af which approaches fo 1 which acfually makes FISTA 
behaviour similar fo fhe iFB mefhod. 

High-order acceleration For fhe -norms and TV semi-norm, since fhey are polyhedral, hnife fer- 

minafion can be obfained once fhe manifold is idenfihed. For 2 -norm which is nof polyhedral, we applied 
fhe Riemannian Newton mefhod which converges quadrafically, leading to a dramatic accelerafion as can be 
seen in Figure 4(b). For fhe nuclear norm, a non-linear conjugate gradienf mefhod is applied, leading again 
to a much faster (super-linear) local convergence. 

Oscillation of the FISTA method As observed from Figure 4, FISTA method oscillates for both choices 
of q. No oscillation appears for the iFB method since the value of the inertial parameter is not big enough. In 
order to have a better visualisation of the oscillation of iFB/FISTA methods, we choose the LASSO problem 
for illustration, set 6 = a and locally adjust the value of a so that the oscillation period is integer. The result 
is shown in Figure 5, where the oscillation period of the tested example is 20. 



k 

Figure 5: Local oscillation of the iFB/FISTA methods on LASSO problem. Local oscillation of the iFB 
method, where the oscillation period is 20 . 


5,5 Comparisons of step-size and inertial settings 

In this section, we provide more comparisons of the iFB method, on the choices of different step-size 7 ^ and 
also the difference between the inertial parameters a^, bj- respectively. 

Comparison of 7^ = (3 vs 7^ = 1.5/3 We compare the difference between different step-sizes, and two 
choices of 7 ^ are considered: 7 ^ = /3 and 7 ^ = 1.5/3, and the corresponding inertial parameter are. 
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• iFB 7 fc = /3: flfc = - 2 - 10 ^ same as above tests such that Theorem 2.3 applies; 

• iFB 7 fc = 1.5/3: at-, hk are chosen according to (2.3) such that Theorem 2.1 applies. 

In comparison, FISTA method with q = 2 and o' = 50 is also added. Two numerical experiments on ii- 
norm and nuclear norm are illustrated in Figure 6 . From the numerical results, we can infer the following 
observations. 

• For FB, larger 7 ^ leads to faster global convergence and activity identification. Fiowever this does 
not mean that the bigger the better locally. As we discussed in Section 4.5, the best choice to get the 
optimal local linear rate is 2/3/(1 + a/3). 

• iFB is faster than FB under the same choice of 7 ^. FISTA (/ = 50 is no longer the fastest one, while 
it is outperformed by iFB 7 ^ = 1.5/3 on the LASSO problem. Moreover, it should be noted that, the 
inertial parameters of iFB can be optimized according to (4.13), which can make the iteration even 
faster. 

In accompany with the high-order acceleration result present above, it can be conclude that in practice, the 
inertial+high-order method hybrid strategy is an ideal choice for solving ("Popt)- 




(a) ^i-norm (b) Nuclear norm 

Figure 6 : Comparison of iFB method with different step-sizes. 


Comparison of vs Now let’s assess the influence of inertial parameter choice, same as above f'l-norm 
and nuclear norm are considered. The step-size 7 is fixed as 7 ^ = /3. 

For the iFB method, the online updated rule (2.3) is applied, with Cak = Ch^k = -jT 2 > and 4 

different combinations of (a, b) are considered, which are 

(0.3,0.2or0.6) and (0.8,0.2or0.6). 

For both examples, if we let bk = 0, then the optimal local choice Oopt obtained through (4.13) is between 
0.3 and 0.8. The obtained plots are depicted in Figure 7, whence we summarize the following observations: 

(i) The time to activity identification is more dependent on the value of a. Clearly, relatively bigger values 
of a lead to a faster identification. On the other hand, when a < Uopt (case a = 0.3), bigger values of 
b lead to slower identification, while the opposite situation occurs when a > Uopt (case a = 0 . 8 ). 

(ii) The convergence rate also depends more on the choice of a, since with fixed a, the rate difference 
caused by different values of b is small, see the blue dashed/solid lines, and magenta ones. 
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(a) ^i-norm (b) Nuclear norm 

Figure 7: Comparisons on the difference between the inertial parameters at and bk, the step-size 7 is fixed 
as j3. 

6 Discussion and conclusion 

In this paper, we proposed a generalized inertial Forward-Backward splitting scheme which covers several 
existing methods as special cases, and presented the corresponding global convergence analysis. Under 
partial smoothness, we established that this class of schemes identify the active manifold in finite time, and 
then converge locally linearly. The predicted rates were shown to be very sharp. We verified our fheorefical 
findings wifh concrefe numerical examples from signal/image processing and machine learning. 

Mosf of our resulfs can be exfended fo fhe non-convex selling by inlroducing appropriale supplemen- 
lary assumptions, such as prox-regularily and fhe nonsmoolh Kurdyka-Lojasiewicz inequalily. This will be 
Irealed in a fulure work. 
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A Proofs of Section 2 

Throughoul Ihis section, % denotes a real Flilberl space. We give a proof in fhe mosf general selling, 
i.e. solving (Pine) on %. We denote —^ slrong convergence and ^ weak convergence on T-L. We firsl briefly 
inlroduce some preliminaries which are needed for Ihe convergence proof. Eel A : P =1 P be a sel- 
valued operator. The graph of A is Ihe sel gphA = {{x,y) G P x 'H\y G A{x)}, and ils zeros sel is 
zerA = {a; G P|0 G A{x)}. 

A sel-valued operator A : T-L ^ His monotone if 

(V(a:,u) G gphA), (V(y,u) G gphA), {x -y, v -u) >0. (A.l) 

II is moreover maximal monotone if gph A can nol be conlained in Ihe graph of any olher monotone operator. 
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Let /3 g]0, +oo[, B : T-L ^ T-L, then B is /?-cocoercive if the following holds 

(Vx,?/ e H), ^\\Bx - By\\^ < {Bx - By, x-y), (A.2) 

which indicates that B is /3“^-Lipschitz continuous. 

Proof of Theorem 2.1. Let x* G zer(A + B), i.e. a solution (Pine), which exists thanks to (H.6). From 
( 1 .4), we get 

-B{x*) e A{x*), 

^ ^ ’ (A.3) 


Define the following quantities 

2^ II 5 ^x,k 2 11^^ ill 1 Pa,/c+l 211^^!^ ^/c+lll 1 

By definition of ya,k we have 


1 

2 


Ilf/6,fc ^fc+l|| • 

(A.4) 


(Pk - Pk+l = ^{Xk- X*, Xk - X*) - ^{Xk+l - X*, Xk+1 - X*) 

= ^{xk- Xk +1 -X* + 2xk+i - X*, Xk - Xk+i) 

Bx,k+1 4 “ {^Xk ya,k ya,k Xk+1, Xk+1 X 

Bx,k+1 4“ {ya,k Xk+1, Xk+1 X ^ (lk{Xk Xk—1, Xk+1 X ). 


Meanwhile, by virtue of the monotonicity of A and (A.3), we have 

{'JkUk+i - IkU*, Xk +1 - X*) > 0, Vtifc+i e A{xk+i),u* e A(x*) 
{{y+k - Xk+i) - lkB{yh,k) + 7 fc-B(x*), Xfc+i - x*) > 0, 


which leads to 

{ya,k Xk+1, Xk+1 X ) ^ 7 fc('^(f/ 6 ,A:) B{x ), Xk+1 X ). 

Combining this with (A.5), we obtain 

Pk - Pk+i > Ex,k+i + lk{B{yb,k) - B{x*), Xk +1 - X*) - ak{xk - Xk-i, Xk+i - x*). (A. 6 ) 

For (xfc — Xfc_i, Xk+1 — X*), we have 


{Xk Xk—l, Xk+1 X ) (Xfc Xk—1, Xk+1 Xk 4“ Xk X ) 

= (Xfc - Xk-1, Xk+1 - Xk) + (Xfc - Xfc_i, Xk - X*) (A.7) 

{Xk Xk—1, Xk+1 Xk) 4“ (^E^- k 4“ Pk Pfc—l) , 

where we applied the usual Pythagoras relation to (xfc — Xk-i, Xk — x*), 

2(ci — C 2 , Cl — C 3 ) = ||ci — C 2 IP 4- ||ci — C 3 IP — ||c 2 — C 3 IP. 

Putting (A.7) back into (A. 6 ) yields 

Pk+l Pk 0,k{pk Pk—l) 

A E^^k+l 'ykiB(^yb,k) 77(x ), Xk+l x ) -\- Clk{xk Xk— 1 , Xk +1 Xk) 4“ Q'kEx,k' 
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Since B is /3-cocoercive, then 


{B{yb,k) - Xk+i - X*) = {B{yb^k) - B{x*), Xk+i - yb,k + yb,k - x*) 

> P\\B{yb,k) - B{x*)f + {B{yb,k) - B{x*), Xk+i - yb,k) 

> (3\\B{yb,k) - B{x*)f - (3\\B{yb,k) - B{x*)f - ^Eb,k+i 

- -^Eb,k+i- 

Denote /ifc = 1 - ^ G [^, 1 - = Ofc - and Vk = Xk+i - Xk - j^{xk - Xk-i). Substituting 

(A.9) back into(A.8), and since Et^k^i = + blE^^k + h{xk - Xk - Xk-i), we get 


^ki}Pk l) 

^ Ex^k+1 E '^^E}j^k-\-l E Q'ki^k 1; ^k-\-l ^k} E O’kEx^k 

l^kEx^k-\-l E {o^k 1; ^k-\-l ^k} “1“ “1“ 2 ^ ^Ex^k 

^ II ^k-\-l II E l^k i^k 1; ^A:+l ^k) “1“ ~2 ^^ ^x,k 

= {-^\\xk+i -Xk- + ^Ex,k) + K + ^)E.,k 


(A. 10) 


k'k 

,2 


+ («* + ^ + < -f lk*f + (^ + ^)E 


l-^k 

2 


yk W 


-'x.k 


< _ Eh. 


2 II-'CM + + (1 “ ^)^fc)^ 3 :i,fc- 

Denote 9k = y)k~ ‘Pk-i 6k = (^Ofc + (1 “ 7^)9k)Ex,k- We then arrive at the following key estimate 


2p> 

9k+i < + o,k9k + ^fc- 

If ttfe g] 0, d], (A.ll) yields 

9k+l ^ ^ll^fcll ~f ^k9k “f ^ ^k9k ^k — ^fc[^A:]+ “f ^fc; 
where [0]+ = max {0,0}. Asa result, we have 

[^fc+i]+ < 9[6k]+ + 6k- 


(A. 11) 


(A. 12) 


Assumption (2.1) is equivalent to the fact that 6k is summable. Therefore, using that a < 1 and applying [21, 
Lemma3.1(iv)], it follows that [9k]+ is summable. Therefore, 

^k+i - - Ok+i - E •=i[0.]+ = ^k- E ■=i[0.]+- 

It follows that the sequence {^pk — Sj=i[^j] + )fceN is decreasing and bounded below, hence convergent, 
whence we deduce that pk is also convergent. 

If Ofc = 0, (A. 10) entails 

‘fk+l < + 5k- 
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We then conclude that the sequence {xk)keN is quasi-Fejer monotone (of type III) relative to + B) 

[21, Definition 1.1(3)], and thus tpk is convergent [21, Proposition 3.6]. 

In summary, for G [0, a], limfc_^oo “ 3:*|| exists for any x* G zer(74 + B). 

By assumption (2.1), ak{xk — Xk-i) —> 0 and bk{xk — Xk-i) —> 0, and thus 

^(xfc-xfc_i)^0, (A.I3) 

since /Xfc > ^ > 0. Moreover, from (A. 1 2), we obtain 


— g {o-fo + ^ +00. 

Consequently, Vk —> 0. Combining this with (A. 13), we get that Xk+i — Xk ^ 0. In turn, Ha^k — Xk+i —> 0 
and yi)^k ^ 0 - 

Let X be a weak cluster point of {xk)keN^ let us fix a subsequence, say Xkj x. We get from(1.4) 
that 

def Va^kj 73/ \ A f \ 

^ Ikj ^ ^ 

Since B is cocoercive and yh^kj x, we have B{yi) kj) —>• B[x). In turn, Uk^ —>• —B{x) since 7 fc > e > 0. 
Since {xk^^i,Uk^) G gphA, and the graph of the maximal monotone operator A is sequentially weakly- 
strongly closed in Tf x , we get that —B{x) G A[x), i.e. a: is a solution of (Pine)- Opial’s Theorem [49] 
then concludes the proof. □ 

Proof of Theorem 2.3. From (A. 10), we apply Young’s inequality to get 

V^fc+l ^k{}Pk ‘fk—l) 

< (^ - l)£'a:,fc+l + (flfc - - Xk-1, Xk+1 “ Xfc) + (ofe + 

~ “ ^')Ex,k+l + |flfc — ^2^)2 ~ ~ Xk-l\\^) + {^^k + 0 .k)E^^k 

^ “ 1 + l«fc “ ^^1) A'a:,fc+i + + ctfc + \ak - -^^\)Ex,k 

^kEx^k+l T EkEx^kf 

where 5"^ = ^ — 1 + |afc — Tfc = ^h\ + ak + \ak— Suppose a^, bk and 7 ^ are non-decreasing 

such that Sk, Tk are also non-decreasing. Define (pk = y^k — + TkE^^k, then 

4^k+i 4‘k T Ek+iEj- k+i} (v^fc 1 T EkEj- k) 

— (v^fc-i-i 1 ) T Ek+iE^ k+i EkE^ k (A. 14) 

A SkEx^k+l T EkEx^k T EkEx^k ~ (‘S'fc T Tk+\)Ex^k+l- 

Case 1) ttk G [0,d], bk G [0,6], bk < a^. We have '^bk < fl/c, then from (A.14), and under the second 
condition in (2.4), 

4^k+l 4^k ^ i^k+1 T Ek-\-'l')Ex^k+l ~ ((3flfc+l 1) 4“ bk+l) ') Ex^k+l A xEx^k+lj 

(A. 15) 
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Case 2) G [0,a], bk g]0,6], ak < b^. Since ^re non-decreasing, then from (A. 14) we have, 

~ ^^2^ “ 1 + l«fc+l - -^^^h+l\ + + l®fc+l “ ^|^^fc+l|)-Ba:,fc+l- 

Next we discuss the relationship between ak+i and which splits into two subcases. 

(i) If < Ufc+i, A: G N, then from the second condition in (2.4), 

4>k+l — 4>k < — 1 + flfc+l — “ ^^^^fc+l)-Ea:,fe+l 

= ((SUfc+l — 1 ) + "^ 2 ^^ (1 “ ^fc+l)^)-S'ai.fc+l < —TEx^k+l- 


(A. 16) 


(ii) If Ufc+i < A: G N, then from the first condition of (2.4), we have 

(l>k+l - 4>k < - 1 + 

= (“(1 + ®fc+i) + ^fc+i)^) Ex^k+i < —TE^^k+i- 

From (A. 15) (respectively (A. 16) or (A. 17)), (pk is non-increasing. Therefore, we have 

iPk - a(pk-i < (pk < (pi ^ ^k < + <t>i ^ 


(A. 17) 


In the meanwhile, from (A. 15) we have 

-sr^k 

4^k+l ~ 4’! — z_^j=0 

^ ^ - ^fc+i) <-(</>! + «<4fc) < -(d^“^Vo + T^) < + 00 , 

— ‘J—'J " T T T ^ 1 — 0-^ 

which means that the summability condition in (2.2) is satisfied. The resf of fhe proof follows fhe same 
argumenfs as in fhose in fhe lasf par! of fhe proof of Theorem 2.1. □ 

Denote fhe e-enlargemenfs of A. 

Proof of Proposition 2.5. Lef x* G zer(A + B). Recall from (2.5) fhaf 

Va^k 'yk4,k ^k-\-l ^ ^(^fc-l-l)* 


Thus, we gel 

{Ua^k ^k-\-l 'dk^EijJk^k) B{x )) ^k+1 ^ ^ 

Combining Ihis wilh (A.5), we oblain 

^k 4^fc-l-l — Ex^k+l T 7fc('A^(f/6,fc) Bi^X ) A ^ ) ^k{,Xk ^k—\i ^k+1 ^ ) 'dk^k‘ 

Conlinuing as after (A. 6 ) in fhe proof of Theorem 2.1, we oblain fhe key eslimale 

(^k+l < + <^fc + Ik^k + 'IkiCk^ Xk+l — X*) 

— + d,6k + Sk + 'y£k + v^7||^fc||-^(4fc+i, (A.18) 

where 7 = (2/3 — e), 9k, 5k and Vk are as defined in (A.l 1). This yields 

Sk+i < ^ {5j + JEj + v^tIIOIIVV^H-i)- 
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(i) ttfc g] 0, a]: summing up the last inequality, we get 


~ '^k+l (V^l ^™+l) 

+ '^1 YL =1 


which entails 

‘^k+l < C+ \/27X]m=i "illCmllv^ < C + \/27^^t!l "lllCmIlVV’m+l, (A.19) 

where c = ^Pi + ^ 0. By assumption on the sequences 

(em)meN (^m)meN 5 c is bounded. Using the fact that (m||^m||)meN is summable, it can be 
easily shown, e.g. [6, Lemma A.9], that since the sequence {<fk)ken satisfies (A.19), it also obeys 
V’fc < \/c + EjeNJ'll^ill < +00- Denote t = ^+ EjeNJ'llCjll- (A.18) then becomes 

6*fc+i < “^IlffclP + + Sh + 7efc + V^7||Cfc||, 

which is of the form (A.12), where Sk is replaced by dk + J£k + \/27\/f||^fc||. and the latter is a 
summable sequence. With the same arguments as those after (A. 12) for Ofc g] 0, d], we deduce that ipk 
is convergent. 

(ii) ttfe = 0: in this case, (A.18) reduces to 

(f’fc+i < ^k + Sk + lek + V2-^\\ik II Vv^fc+i < 

Again, by virtue of [6, Lemma A.9] and summability of the sequences and (||Cj||)j£N> 

we have ifk <t = + EjeN(^i + + ll^ll) < +oo. Consequently, we have 

^k+i <‘fk + 4 +7efc + ^^711411- 

That is, the sequence {xk)ken is quasi-Fejer monotone (of type III) relative to zer(74 + B), and thus 
(/jfc is convergent. 

In summary, for ak G [0, d], limfc_^oo \\xk — a:*|| exists for any x* G zer(A + B). 

The rest of the proof is patterned after the last part of the proof of Theorem 2.1, where we now use the fact 
that 4 0 by assumption, and that the graph of A : M+ x =1 is weakly-strongly sequentially closed 

in M+ xT-L xT-L [57, Proposition 3.4(b)]. □ 


B Proofs of Section 4 

B,1 Riemannian Geometry 

Let Ad be a C^-smooth embedded submanifold of M” around a point x. With some abuse of terminology, 
we shall state C^-manifold instead of C^-smooth embedded submanifold of M". The natural embedding of 
a submanifold Ad into M"' permits to define a Riemannian structure and to introduce geodesics on Ad, and 
we simply say Ad is a Riemannian manifold. We denote respectively Tm{x) and Mm{x) the tangent and 
normal space of Ad at point near a: in Ad. 
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Exponential map Geodesics generalize the concept of straight lines in M"', preserving the zero acceleration 
characteristic, to manifolds. Roughly speaking, a geodesic is locally the shortest path between two points on 
A4. We denote by g(t; x, h) the value at f G M of the geodesic starting at g(0; x,h) = x (z A4 with velocity 
g{t; x,h) = ^ (t; x,h) = h G Tm {x) (which is uniquely defined). For every h G Tm (x), there exists an 
interval / around 0 and a unique geodesic g(f; x,h) : I ^ A4 such that 0(0; x,h) = x and 0(0; x, h) = h. 
The mapping 

Exp^, : 73vi(a;) —> Ad, h i-> Exp 2 ,(/i) = 0(1; x, h), 
is called Exponential map. Given a;, x' G Ad, the direction h G Tm {x) we are interested in is such that 

Exp 3 ,(/i) = x' = 0(1; x, h). 

Parallel translation Given two points x,x' G Ad, let Tm{x),Tm{x') be their corresponding tangent 
spaces. Define 

T :Tm{x) ^ Tm{x'), 

fhe parallel franslafion along fhe unique geodesic joining x fo x', which is isomorphism and isomefry w.r.f. 
fhe Riemannian mefric. 

Riemannian gradient and Hessian For a vector v G J\fM{x), the Weingarten map of Ad at a: is the 
operator 2IJa:(-, u) : Tm{x) —^ Tm{x) defined by 


2IJ.(-,t;) = -Pr^(.)dE[h], 

where V is any local extension of r; to a normal vector field on Ad. The definition is independent of the 
choice of the extension V, and Wx(-, v) is a symmetric linear operator which is closely tied to the second 
fundamental form of Ad, see [20, Proposition II.2.1]. 

Let G be a real-valued function which is along the Ad around x. The covariant gradient of G at a;' G Ad 
is the vector V_MG(a:') G Tm{x') defined by 

(V^G(x'), h) = ±g(Vm{x' + th))l^^, yh G Tuix'), 

where is the projection operator onto Ad. The covariant Hessian of G at x' is the symmetric linear 
mapping V^G(a;') from Tm {x') to itself which is dehned as 

{VliG{x')h, h) = ^G{Fm{x' + yheTMix'). (B.l) 

This dehnition agrees with the usual definition using geodesics or connections [42]. Now assume that Ad 
is a Riemannian embedded submanifold of and that a function G has a G^-smooth restriction on Ad. 
This can be characterized by the existence of a G^-smooth extension (representative) of G, i.e. a G^-smooth 
function G on such that G agrees with G on Ad. Thus, the Riemannian gradient VxG(x') is also given 

by 

V^G(x') = Pr^(.0VG(x'), (B.2) 

and V/i G Tm{x'), the Riemannian Hessian reads 

vl,G{x')h = Vr^^^,)d{VMG){x')[h] = Pr^(,,qV^G)[/i] 

= Pr^(,q v2G(x')h + 211.' {K Paa^(.o VG(x')), 
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where the last equality comes from [2, Theorem 1]. When A4 is an affine or linear subspace of M", then 
obviously A4 = x + 7m(x), and Wx'{h, P^^( 2 ./)VG(a;')) = 0, hence (B.3) reduces to 

vl,G{x') = Pr^(.,)V^G(x')Pr^(.o. 


See [35, 20] for more materials on differential and Riemannian manifolds. 

The following lemmas summarize two key properties that we will need throughout. 

Lemma B.l. Let x G A4, and Xh a sequence converging to x in A4. Denote : Tm{x) —> TM{xk) be the 
parallel translation along the unique geodesic joining x to Xk- Then, for any bounded vector u G MT, we 
have 


i^k PtmK) - ^TM{x)h = o(||m||). 

Proof. From [1, Chapter 5], we deduce that for k sufficiently large, 

^k^ =PTM(x)+o{\\Xk-x\\). 

In addition, locally near x along Ai, the operator x i— ^Tm(x) hence, 

IK'^fc '^TMjxk) ~'^Tm{x))'^\\ ^ \\^TM{x)iTTM{xk) ~ PrA/|(3;))||||M|| _ ^\\ 

fc->oo ||m|| “ fc->oo I|u|| ^ 

^ ||Pr^(a.;,) - ^Tm{x)\\ + o{\\xk - a;||) = 0. 

(£—>■00 


□ 


Lemma B.2. Letx, x' be two close points in Ai, denote r : Tm (x) —> Tm (x^) tbe parallel translation along 
the unique geodesic joining x to x'. The Riemannian Taylor expansion of^G C‘^{A4) around x reads, 

t-^VmHx') = + V^$(a;)P 7 -^( 3 .)(x' - x) + o{\\x' - a;||). 

Proof. Since x,x' E Ai are close, we have x' = Exp^{h) for some h G Tm {x) small enough, and thus, the 
Taylor expansion [56, Remark 4.2] of around x reads 

t-^VmHx') = ^mHx) + ^MHx)h + o(||/t||). (B.4) 


Moreover, form the proof of [42, Theorem 4.9], one can show that 

^Tm(x){x') = Pr^(:,;)(Exp^(/i)) = P 7 -^(^)(a;) + h + o{\\hf). 

Substituting back into (B.4) we get the claimed result. □ 


B.l Proofs 

Proof of Proposition 4.1. 

(i) Since F is locally around x*, there exists e > 0 sufficiently small such that for any S G Bg(0), we 
have 

+ (i) - = F(x* + S) - F(x*) - {VF(x*), S) + B(x* + S) - E(x*) + {VF(x*), S) 

= i(<5, V^F(x* + tS)6) + B(x* + <5) - R(x*) + (VF(x*), S), t g]0, 1[. 

Let xt = X* + tS G Be(a:*). Since (RI) holds and V‘^F{x) depends continuously on a: G B£(x*), we 
have Ft^„'V‘^F{x)Ft,,* F aid for any such x. This holds in particular at Xt- We then distinguish two 
cases. 
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(a) 6 ^ keic{'V‘^F{xt)). In this case, it is clear that 

+ 6)- $(x*) > ^{5, V^F{xt)6) > a/2\\5f > 0 

since F is convex and locally and R is convex with —VF{x*) G dR{x*). 

(b) 6 G ker(V^F(a;t)) \ {0}. Since i? is a proper closed convex function, it is sub-differentially 
regular at x*. Moreover dR{x*) ^ 0 (—VF(x*) is in it), and thus the directional derivative 
R'{x*, ■) is proper and closed, and it is the support of dR{x*) [54, Theorem 8.30]. It then follows 
from the separation theorem [32, Theorem V.2.2.3] that 

-VF(x*) G n{dR{x*)) ^ R\x*,5) > -{VF{x*), 6), V(5 s.t. R\x*-,d) + R'{x*-,-5) > 0. 

As ker(i?'(a:*; ■)) = Tx* [61, Proposition 3(iii) and Lemma 10], and in view of (RI), we get 

-VF{x*) G vi{dR{x*)) ^ R'{x*-,S) > -{VF{x*), 5), V<5 ^ Tx* 

R'{x*-,6) > -{VF{x*), 5),y5e ker{V^F{xt)) \ {0}. 

Combining this with classical properties of the directional derivative of a convex function yields 

$(a;* + <5) - T>(a:*) = R{x* + <5) - R{x*) + {VF{x*), 5) 

> R'{x*-,5) + {VF{x*), 6) >0, 


which concludes the first claim. 

(ii) Let 'k as defined in fhe proof of Lemma 4.3. If i? G PSFa:* {M.x*), the Riemannian Hessian of <I> reads 

In view of Lemma 4.3(i), ^ 'k (x*) is positive semi-definite on Tx* ■ On the other hand, hypothesis 

(RI) entails positive definiteness of P^^* VT"(a;*)PT^*. Altogether, this shows that is 

positive definite on Tx* \ {0}. Local quadratic growth of <k near x* then follows by combining [37, 
Definition 5.4], [42, Theorem3.4] and [30, Theorem 6.2]. 

□ 


Proof of Lemma 4.3. By definition of U, Uh = 0 for any h G T^. Thus, in the following we only examine 
the case h G Tx* . 

(i) Let 'k(a;) = R{x) + {x, VF{x*)). From the smooth perturbation rule of partial smoothness [37, 
Corollary4.7], T' G PSFa;*(Ada;*). Moreover, from Fact 3.3 and normal sharpness, the Riemannian 
Hessian of 'k at x* is such that, V/i G Tx *, 

7VL * = 7Pt,* + 72B:.* (/i, P^r V$(a;*)) 

X 25* 

= -fPT,,v^R{x*)h + *rWx* {h, Pyx V$(x*)) 

X* 

= Hh = Uh, 

where ~ is the smooth representative of the corresponding function. 

Since — VF(x*) G n{dR{x*)), we have from [38, Corollary 5.4] that 


d^R{x*\-VF{x*))h 


+ h^Tx*, 

0, h ^ Tx*, 
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where d‘^R{x*\ — VF{x*)) denotes the Mordukhovich generalized Hessian mapping of function R at 
{x*, —VF(x*)) G gph (dR) [43]. As i? G ro(M"), dR is a maximal monotone operator, and in view 
of [50, Theorem 2.1] we have that the mapping d‘^R{x*\ — VF{x*)) is positive semi-definite, whence 
we conclude that V/i G T^*, 

0 < -y{d^R{x*\ - VF{x*))h, h) = h) = {Uh, h). 

(ii) In this case, U = 7 ?^^* . Let xt = x* + th, t > 0, for any scalar t and h G Tx*. 

Obviously, xt G x* -f Tx* = A4x*, and for t sufficiently small, by Fact 3.2, Tx^ = Tx*. Thus, 
Vu G dR{x*) and Vu G dR{xt) 

0 < t~'^{v — u, Xt — X*) = t~^{v — u, 

= f^Pr,*(x-u), h) 

= f“^(Pr„jU - Pt^*m, h) 

(byFact3.3) = i?(x*)), h) 

(by(B.2)) = (t"^Pr^*(V.R(x*-btPr^*/i) -VR{x*)), h). 

Since R is passing to the limit as f —)• 0 leads to the desired result. □ 

Proof of Lemma 4.4. 

(i) (a) is proved using the assumptions and Rademacher theorem, (b) and (c) follow from simple linear 
algebra arguments. 

(ii) From Lemma 4.3, we have VFG = meaningthat ILGis similarto 

The latter is symmetric and obeys 

II 1^1/2(^1^1/2II < ||^1/2||||(^||||^1/2|| ^ 

where we used (i)-(b) to get the last inequality. Thus has real eigenvalues in ] — 1,1[, 

and so does IPG by similarity. The last statement follows using (i)-(c). □ 

We define fhe iferafion-dependenf versions of fhe mafrices in (4.2), i.e. 

Hk = 7fcPr,,*v2F(x*)PT,*, Gfc = Id - Uk = - Hk, 

= [(1 + b)W{Gk - G), -bW{Gk - G)], (B.5) 

Mk ,2 = [((flfc - bk) -{a-b))W + {bk - b)WGk,-{{ak - b^) -{a-h))W- {b^ - b)WGk] ■ 

Affer fhe finite idenfificafion of A4x*, we have Xk G A4x* for Xk close enough fo x*. Lef Tx^ be fheir 
corresponding fangenf spaces, and define : Tx* —> Tx^. fhe parallel franslafion along fhe unique geodesic 
joining from Xk fo x*. 

Before proving Proposilion4.5, we firsl esfablish fhe following infermediafe resulf which provides useful 
esfimafes. 

Proposition B.3. Under the assumptions of Proposition 4.5, we have 

\\ya,k - a:*|| = 0(11411), \\yb,k - a;*|| = 0(11411), Ikfc+ill = 0(11411), ^ ^ 

- PG*)(VF(t/,,fc) - VF{xu+i)) = o(||4||). ^ 

and 

\\W{Uk - U)rk+i\\ = o(||4||), ||Mfc,i4|| = o(||4||) and ||Mfc,24|| = o(||4||). (B.7) 
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Proof. We have 


\\ya,k - a:*|| = 11(1 + ak)rk - akVk-iW < (1 + afc)||r’fc|| + ak\\rk-i\\ 

< (1 + afc)(||rfc|| + ||rfc_i||) < 'v/2(l + afc)||<^/c||) 

whence we get the hrst and second estimates. In turn, we obtain 

Ikfc+ill = \\wox^^R{ya,k - Ik^Fiyk^k)) - prox.^^^(x* - 7fcVF(x*))|| 

< II {ya,k - X*) - 7fc (VF(|/b,fc) - VF(x*)) II 

< 11(1 + ak)rk - akTk-iW + ^||(1 + bk)rk - hkVk-iW 

< (1 + afc)||rfc|| + afc||rfc_i|| + (1 + ^fc)^||rfc|| H—^^||rfc_i|| 

< ((1 + ttfc) + (1 + ^fc)^)(lkfc|| + lkfc-i||) 

— ((1 + ®fc) + (1 + ^fc)^)'v/2||^^fc||; 

where we used non-expansiveness of the proximity operator and assumption (H.2). This yields the third 
estimate. Combining Lemma B.l, assumption (H.2), (B.8) and (B.9), we get 

- PT.*)(VF(2/,,fc) - VF{xk+i)) = o(||VF(y,,fe) - VT^(xfe+i)||) 

= o(||y,,fc-x*||)+o(||r,+i||)= 0(11411). 

Let’s now turn to (B.7). Recall the function 'h defined in the proof of Lemma4.3(i). First, we have 

\\W{Uk-U)rk+4 _ y l|lP(7fc-7)Vi,^,'I/(4)Pr,*rfe+i|| 

||rfc+i|| ||rfc+i|| 

< lim |7fc - 7 ||| 1 P||||V^^*'I/(x*)Pt,„J| = 0, 

k^oo ^ 


(B. 8 ) 


(B.9) 


which entails ||VF(f7fc - f7)?’fc+i|| = o(||rfc+i||) = o(||4||). Again, since 7 /, 7 , 


lim 
fc—>-oo 


||Mfc,i4|| 


= lim 

fc—>-oo 

< lim 
fc —^00 


< lim 

fc—>-oo 


(1 + b)W{Gk - G)rk - bW{Gk - G)rk-i\\ 

II4II 

(l + b)||IP||||Gfc-G||(||r,|| + ||r,_i||) 

11411 

(1 + 6)||IP|||7fc - 7|||Pr,..V^i"(a;*)Pr,..||x/2||4|| 

II4II 


= lim \/ 2 | 7 fc- 7 |((l + 6 )||IL||||PT,*v 2 F’(x*)PT,*||) = 0 , 

fc—>-oo 


as (l + 6 )||IL||||Px'^*V^F(a:*)PT’^* II is obviously bounded (by 2//3). Similarly, for Mfc^ 2 . since —> a, 4 
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b, 


II (((flfc - bk) - (a - b))Wk + (bk - b)WkGk) {ru - r-fc-i)!! 



II4II 


< lim 


< lim 


< lim 


(|afc -a\ + \bk - b\)\\{Wk + WkGk){rk - rk-i)\\ 

II4II 

(|afc - a| + \bk - b\)\\Wk{ld + Gk)\\\\rk - rk-i\\ 

II4II 

{\ak — a| + l^fc — l)|)||V7fc(Id + Gfc)||\/2||dfc|| 

m 


lim V2{\ak - a| + \bk - 6|)|| Wfc(Id + Gfc)|| = 0, 


where Wk, Gk are bounded. 


□ 


Proof of Proposition 4.5. (1.3) and the first-order optimality condition for problem (Vopt) are respectively 
equivalent to 


ya,k - Xk+1 - 7fc(VF(yb,fc) - VF{xk+i)) G 7 fc 9 $(a:fc+i) 

0 G 7 fca$(a;*). 

Projecting into and T^*, respectively, and using Fact 3.3, leads to 

IkTk+i'^M^^^iXk+l) = Xk+i^T^^^^ {ya,k - Xk+1 - Iki^Fiyu^k) - VF{Xk+l))) 

7fcVAr,*^>(4) = 0. 

Adding both identities, and subtracting on both sides, we arrive at 


Xk+i^n.+.rk+i + jk{Tk+iyM,*Hxk+i) - Vm,*^x*)) 
Tk+iPr,^^^ iv+k - X*) - jkTk+iPT,^^^ (Vi^(yb,fc) - VF(a;fc+i)). 


(B.IO) 


By virtue of Lemma B. 1, we get 


= Pr,*rfc+i + - Pt^.H+i = Pr^.rk+i + o(||rfc+i||). 

Using [39, LemmaS.l], we also have 

rfc+i = Pr^*rfc+i -f o(||rfc+i||), 


and thus 


k+l 

where we also used (B.6). Similarly 


Xk+iPr.^^.rk+i = Vk+i + o(||rfc+i||) = Vk+i + o(||4||), 


(B.ll) 



= Pr,* {y+k - X*) + o{\\ya,k - 4 ||) 

= Pr,* {ya,k - X*) + o{\\dk\\) 

= (1 + ak)PT^*rk — afcPr^*rfc_i -f o(||4||) 

= (1 -f ak)rk - akXk-i -f o(||rfc||) -f o(||rA;_i||) -f o(||4||) 

= {y+k - X*) + o(||4||)- 


(B.12) 
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Moreover owing to Lemma B.2 and (B. 6 ), 


T- ^Vx^*^>(a;fc+i)-VAr,*^(a;*) = V^^^$(a;*)Pr,*rfc+i + o(||rfc+i||) ^ 

= V^,*^(®*)Pr,*rfc+i + o(||4||). 

Therefore, inserting (B.l 1), (B.12) and (B.13) into (B.IO), we obtain 

(ld + 7fcV^^,T>(a;*)PT,*)rfe+i = (?/„,/,-a:*) - 7fcT4\PT,^^^ (Vi^(t/6,fc) - VF(a:fc+i)) +o(||4||)- 

(B.14) 

Owing to (B.6) and local -smoothness of F, we have 
(VF(y6,fc) - VF{xk+i)) 

= Pt,* {VF{yb,k) - VF{xk+i)) + o(||4||) 

= Pt,* {VF{yb,k) - VF(x*)) - Pt,* (VF(xfc+i) - VF{x*)) + o(||4||) (B.15) 

= PT,*V^i^(4)(yb,fc - X*) + o{\\yb,k - **11) - Pt,* V^i^(4)rfc+i + o(||rfc+i||) + o(||4||) 

= PT^.V^Fix*)PT,^{yb,k - X*) - PT,,V^F{x*)PT,^{xk+i - x*) + o(||4||). 

Injecting (B.15) in (B.14), we get 

(Id-f 7fcVXr^*^(a::*)Pr,* - 7fcPT,* V^F(4)PT,*)^fc-ti ,,, 

(d. lo) 

= (Id + Uk)rk+i = {ya,k - X*) - Hk{yb,k - x*) + o(||4||), 

which can be further written as. 


(Id -f Uk)rk+i = (Id + (7)rk+i + (Uk - l/)rk+i = {ya,k - x*) - Hk{yb,k - x*) + o(||4||) 

= ((1 + akVk - akrk-i) - Hk{{l + hkW - + o(||4||) 

= ((1 + ttkYk - (1 + bk)Hkrk) - {akVk-i - bkHkVk-i) + o(||4||) 

= ((flfc - 4)Id + (1 + hk)Gk)rk - ((afc - 4)Id -f bkGk)rk-i + o(||4||) 

= [{o-k — bk)ld + {I + bk)Gk — ((ofc — 6fc)Id-f hfcGfe)] 4 + o(||4||)- 

Inverting Id -f [/ (which is possible thanks to Lemma 4.3), we obtain 

rfc+i + W{Uk - U)rk+i = [{ak - bk)W + (1 + bk)WGk -{^k - bk)W - bkWGk]] dk + o(||4||). 


Using the estimates (B.7), we get 

4-1-1 = 

= ( M + 


{ak - bk)W + (1 + bk)WGk -{ak - bk)W - bkWGk 
Id 0 


Mk,i 

0 


+ 


Mk,2 

0 


dk + o(||4||) 
4 + o(||4||) = Mdk + o(||4||). 


□ 


Proof of Proposition 4.7. 

(i) We have 



{a-b)U + {l + b)G, -{a-b)ld-bG'] 44 
Id, 0 . 42 / 

'{a - b)ri + (I + b)Gri-{a - b)r 2 - 6 Gr 2 ^ _ ^ 
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and thus ri = ar 2 . Inserting this in the first identity, we obtain 

(7^r2 = (a - b)ar 2 + (1 + b)aGr 2 -{a - b)r 2 - bGr 2 

Gr 2 = ^2 = ’7?'2 0 = cr^ - ((a - 5) + (1 + b)7j)a + {a - b) + bi], 

(ii) For this quadratic equation of a, the two roots are 

^ {{a - b) + jl + b)r]) + ^ ((a - b) + (1 + 6)r?) - 

where is the discriminant 

= ((a - 6) + (1 + b)ri)^ - 4((a - b) + br;), 

which is a quadratic function of 3 variables. Consider the following 3 linear functions of a 


oi = {l-r])b-rj, 


02 = (1 - r])b + (1 - ^yl-r|)‘^ 
03 = {l-r])b- 


Ao- < 0 : 02 < a < 1 < (1 - 77)6 + (1 + 

Act > 0 : o < 02, 


(B.18) 


Recall from Lemma 4.4(i) that rj g] — 1,1[. Thus, Oi > 02 when r? g] — 1,0], Oi <02 when 77 G [0,1[, 
and 03 is smaller than both 01,02 independently of 77 . We now discuss each case. 


Case 77 g] — 1 , 0 ]: We have oi > 02, 

• Subcase o G [02, 1 [: 0-1^2 are complex, hence 

|^|2 _ ((a - b) + (1 + b)77)" - (((g - b) + (1 + b)r]f - 4((a - b) + brj)) _ ^ ^ ^ 

Since 02 < 1 b < then we have (1 — — 77 )^ < |crp < 1 + (77 — 1)6 < 1. 

• Subcase a G [ 0 ,02]: Act > 0 and 02 has the bigger absolute value, then 


|( 72 | < 1 “((o ~ 6) + (1 + 6)77) + y/A ct < 2 

Ao- < 4 + 4((o - 6) + (1 + 6)77) + ((o - 6) + (1 + 6)77)" 
. 2(6- a) - 1 

1+26 


(B.20) 


< V, 


which means that \a 2 \ < 1 for o G [ 03 , 02 ], and 10-21 > 1 for o G [0, 03 ]. Moreover, 03 < 0 for 
b G [0, meaning that if 77 > |, then |o- 2 | < 1 for o G [0, 02 ]. 


Case 77 G [0,1[: First we have 02 > Oi, and moreover 

77 /< 1 : 77 G [0,0.5], 

Oi = 0 <++> b = —< 

1 - 7 ? I > 1 : 77 G [0.5,1[. 

Obviously, we have jo-j < 1 holds for any o G [0, 02 ] as long as 77 G [0.5,1], though this situation is 
useless as 6 G [0,1]. In the subcases hereafter, we only consider 77 G [0,0.5]. 
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• Subcase a G [ 02 , 1[: same result as (B.19). 

• Subcase a G [ 01 , 02 ]: oi > 1 ( 721 , hence 

(Ji < 1 <=> ((o — 6 ) + (1 + b)rfj + y/ Aq- < 2 

Ao- < 4 - 4((o - 6 ) + (1 + b)ri) + ((o - 6 ) + (1 + 6 ) 77 )^ (B.21) 

<^0<4(1-??). 

• Subcase a G [0, oij: We have \a 2 \ > jcrij, hence (B.20) applies and the result follows. 

Summarizing this discussion yields the claimed result. □ 

Proof of Theorem 4.15. Since R is locally polyhedral, then V_a 4 ^* is locally constant along M.x* = 
X* + Tx* around x*. Thus, embarking from (B.16) in the proof of Proposition 4.5, for k large enough, we get 


^ ^ ) ^k{yb,k ^ ), 

where we used the mean-value theorem with = 7 ^ V‘^F(x* + — ®*))df ^ 0. Using that is 

symmetric and Im(£'fc)^ = V, we have 


Py(a;fc+i - X*) = Pv(ya,k - x*) = (1 -f ak)Pv(xk - x*) - akPv(xk-i - x*). 


If Ofc = 0, then Pv{xk+i — x*) = Pv{xk — x*). Thus, in the rest, without loss of generality, we assume that 
Ofc > 0 for k large enough. The above iteration leads to 


/Py(a:fc+i - a:*)'\ 

V ^v{xk - X*) ) 


It is straightforward to check that = 


(1 -f afc)Id 

-Ufcld 

Id 

0 

(1 -f afc)Id 

-Ufcld 

Id 

On 


( Py(a;fc - a:*) '\ 

VPy(2;fc_i - X*)/ ■ 


is invertible and admits two eigenvalues 


Ofc > 0 and 1 respectively. Iterating the above argument, and owing to the fact that Xk,ya,ki yb,k 
get 



Py(Xfc - X* 
Pv{xk-i - X 



X*, we 


and n^fc invertible. Therefore, we obtain that Xk — x* G V^, and in turn, Ua^k — s* G and 
yb,k — a:* G V^, for all large enough k. Observe that C Tx*, it then follows that 


Xk-\-l X ya,k X Py^f^fcPy^ X ). 


By definition, Pyrif^Pyx is symmetric positive definite. Thus, replacing by PyxE'^Pyx, G and M 
accordingly, in Lemma 4.4 and Corollary 4.9, and applying Theorem 4.11 leads to the result. □ 
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